The Markdown dialect used when publishing the documentation to the web
site is much more sensitive than Github-flavoured Markdown. In
particular, it requires an empty line before code blocks (unless the
code block starts right after a heading), otherwise the code block
gets formatted in-line with the previous paragraph. Likewise for
bullet-point lists.
* Benchmarks: New SqlBenchmark, add caching & vectorization to some others.
- Introduce a new SqlBenchmark geared towards benchmarking a wide
variety of SQL queries. Rename the old SqlBenchmark to
SqlVsNativeBenchmark.
- Add (optional) caching to SegmentGenerator to enable easier
benchmarking of larger segments.
- Add vectorization to FilteredAggregatorBenchmark and GroupByBenchmark.
* Query vectorization.
This patch includes vectorized timeseries and groupBy engines, as well
as some analogs of your favorite Druid classes:
- VectorCursor is like Cursor. (It comes from StorageAdapter.makeVectorCursor.)
- VectorColumnSelectorFactory is like ColumnSelectorFactory, and it has
methods to create analogs of the column selectors you know and love.
- VectorOffset and ReadableVectorOffset are like Offset and ReadableOffset.
- VectorAggregator is like BufferAggregator.
- VectorValueMatcher is like ValueMatcher.
There are some noticeable differences between vectorized and regular
execution:
- Unlike regular cursors, vector cursors do not understand time
granularity. They expect query engines to handle this on their own,
which a new VectorCursorGranularizer class helps with. This is to
avoid too much batch-splitting and to respect the fact that vector
selectors are somewhat more heavyweight than regular selectors.
- Unlike FilteredOffset, FilteredVectorOffset does not leverage indexes
for filters that might partially support them (like an OR of one
filter that supports indexing and another that doesn't). I'm not sure
that this behavior is desirable anyway (it is potentially too eager)
but, at any rate, it'd be better to harmonize it between the two
classes. Potentially they should both do some different thing that
is smarter than what either of them is doing right now.
- When vector cursors are created by QueryableIndexCursorSequenceBuilder,
they use a morphing binary-then-linear search to find their start and
end rows, rather than linear search.
Limitations in this patch are:
- Only timeseries and groupBy have vectorized engines.
- GroupBy doesn't handle multi-value dimensions yet.
- Vector cursors cannot handle virtual columns or descending order.
- Only some filters have vectorized matchers: "selector", "bound", "in",
"like", "regex", "search", "and", "or", and "not".
- Only some aggregators have vectorized implementations: "count",
"doubleSum", "floatSum", "longSum", "hyperUnique", and "filtered".
- Dimension specs other than "default" don't work yet (no extraction
functions or filtered dimension specs).
Currently, the testing strategy includes adding vectorization-enabled
tests to TimeseriesQueryRunnerTest, GroupByQueryRunnerTest,
GroupByTimeseriesQueryRunnerTest, CalciteQueryTest, and all of the
filtering tests that extend BaseFilterTest. In all of those classes,
there are some test cases that don't support vectorization. They are
marked by special function calls like "cannotVectorize" or "skipVectorize"
that tell the test harness to either expect an exception or to skip the
test case.
Testing should be expanded in the future -- a project in and of itself.
Related to #3011.
* WIP
* Adjustments for unused things.
* Adjust javadocs.
* DimensionDictionarySelector adjustments.
* Add "clone" to BatchIteratorAdapter.
* ValueMatcher javadocs.
* Fix benchmark.
* Fixups post-merge.
* Expect exception on testGroupByWithStringVirtualColumn for IncrementalIndex.
* BloomDimFilterSqlTest: Tag two non-vectorizable tests.
* Minor adjustments.
* Update surefire, bump up Xmx in Travis.
* Some more adjustments.
* Javadoc adjustments
* AggregatorAdapters adjustments.
* Additional comments.
* Remove switching search.
* Only missiles.
* Add inline firehose
To allow users to quickly parsing and schema, add a firehose that reads
data that is inlined in its spec.
* Address review comments
* Remove suppression of sonar warnings
* disable all compression in intermediate segment persists while ingestion
* more changes and build fix
* by default retain existing indexingSpec for intermediate persisted segments
* document indexSpecForIntermediatePersists index tuning config
* fix build issues
* update serde tests
* Fix license check in travis and make it optional
* debug
* fix build
* too loud maven
* move MAVEN_OPTS to top and add comments
* adjust script
* remove mvn option from python script
In Single-Server Quickstart tutorial the overlord and coordinator
is started as one process on port 8081. But in delete data tutorial the kill
task is sent to 8090 port, which fails.
* Add round support for DS-HLL
Since the Cardinality aggregator has a "round" option to round off estimated
values generated from the HyperLogLog algorithm, add the same "round" option to
the DataSketches HLL Sketch module aggregators to be consistent.
* Fix checkstyle errors
* Change HllSketchSqlAggregator to do rounding
* Fix test for standard-compliant null handling mode
* more sql support for expression array functions
* prepend/slice
* doc fixes
* fix imports
* fix tests
* add null numeric expr for proper conversions between ExprEval and Expr and back to ExprEval
* re-arrange
* imports :(
* add append/prepend test
* array support for expression language for multi-value string columns
* fix tests?
* fixes
* more tests
* fixes
* cleanup
* more better, more test
* ignore inspection
* license
* license fix
* inspection
* remove dumb import
* more better
* some comments
* add expr rewrite for arrayfn args for more magic, tests
* test stuff
* more tests
* fix test
* fix test
* castfunc can deal with arrays
* needs more empty array
* more tests, make cast to long array more forgiving
* refactor
* simplify ExprMacro Expr implementations with base classes in core
* oops
* more test
* use Shuttle for Parser.flatten, javadoc, cleanup
* fixes and more tests
* unused import
* fixes
* javadocs, cleanup, refactors
* fix imports
* more javadoc
* more javadoc
* more
* more javadocs, nonnullbydefault, minor refactor
* markdown fix
* adjustments
* more doc
* move initial filter out
* docs
* map empty arg lambda, apply function argument validation
* check function args at parse time instead of eval time
* more immutable
* more more immutable
* clarify grammar
* fix docs
* empty array is string test, we need a way to make arrays better maybe in the future, or define empty arrays as other types..
* Add state and error tracking for seekable stream supervisors
* Fixed nits in docs
* Made inner class static and updated spec test with jackson inject
* Review changes
* Remove redundant config param in supervisor
* Style
* Applied some of Jon's recommendations
* Add transience field
* write test
* implement code review changes except for reconsidering logic of markRunFinishedAndEvaluateHealth()
* remove transience reporting and fix SeekableStreamSupervisorStateManager impl
* move call to stateManager.markRunFinished() from RunNotice to runInternal() for tests
* remove stateHistory because it wasn't adding much value, some fixes, and add more tests
* fix tests
* code review changes and add HTTP health check status
* fix test failure
* refactor to split into a generic SupervisorStateManager and a specific SeekableStreamSupervisorStateManager
* fixup after merge
* code review changes - add additional docs
* cleanup KafkaIndexTaskTest
* add additional documentation for Kinesis indexing
* remove unused throws class
* add s3 authentication method informations
* add druid.s3.fileSessionCredentials related content
* remove authentication parameters to avoid confusion as it is more detailed in S3 Deep Storage page
* streamline s3 docs
* SQL: Allow NULLs in place of optional arguments in many functions.
Also adjust SQL docs to describe how to make time literals using
TIME_PARSE (which is now possible in a nicer way).
* Be less forbidden.
* Upgrade various build and doc links to https.
Where it wasn't possible to upgrade build-time dependencies to https,
I kept http in place but used hardcoded checksums or GPG keys to ensure
that artifacts fetched over http are verified properly.
* Switch to https://apache.org.
* update sys.servers table to show all servers
* update docs
* Fix integration test
* modify test query for batch integration test
* fix case in test queries
* make the server_type lowercase
* Apply suggestions from code review
Co-Authored-By: Himanshu <g.himanshu@gmail.com>
* Fix compilation from git suggestion
* fix unit test
* fix lookup editor to use lookup tiers instead of historical tiers
* use default tier if empty response, fix if configured lookups is null
* fixes
* fix typo
* use relative link to build instructions from top level readme
* add textfile to readme
* formatting
* make README.BINARY plaintext, move LABELS.md to LABELS, README.txt to README
* exclude README.BINARY still
* remove jdk links/recommmendations
* add script to use DRUIDVERSION in textfile README instead of latest, add links to recommended jdk to build.md
* license
* better readme template, links to latest if does not detect an apache release version
* fix
* First set of changes for tDigest histogram
* Add license
* Address code review comments
* Add a doc page for new T-Digest sketch aggregators. Minor code cleanup and comments.
* Remove synchronization from BufferAggregators. Address code review comments
* Fix typo
* add postgresql meta db table schema configuration property (#7137)
If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public
* create postgresql metadb table schema configuration property (#7137)
If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public
check PostgreSQLTablesConfig.java
* modify postgresql readme file. - metadb table schema (#7137)
If the postgresql db schema changes, you must set the configuration
values.
You do not need to set it if there is no change from the default schema
'public'.
druid.metadata.postgres.dbTableSchema=public
check PostgreSQLTablesConfig.java
* Remove SQL experimental banner and other doc adjustments.
Also,
- Adjust the ToC and other docs a bit so SQL and native queries are
presented on more equal footing.
- De-emphasize querying historicals and peons directly in the
native query docs. This is a really niche thing and may have been
confusing to include prominently in the very first paragraph.
- Remove DataSketches and Kafka indexing service from the experimental
features ToC. They are not experimental any longer and were there in
error.
* More notes.
* Slight tweak.
* Remove extra extra word.
* Remove RT node from ToC.
* V1 - improve parallelism of zookeeper based segment change processing
* Create zk nodes in batches. Address code review comments.
Introduce various configs.
* Add documentation for the newly added configs
* Fix test failures
* Fix more test failures
* Remove prinstacktrace statements
* Address code review comments
* Use a single queue
* Address code review comments
Since we have a separate load peon for every historical, just having a single SegmentChangeProcessor
task per historical is enough. This commit also gets rid of the associated config druid.coordinator.loadqueuepeon.curator.numCreateThreads
* Resolve merge conflict
* Fix compilation failure
* Remove batching since we already have a dynamic config maxSegmentsInNodeLoadingQueue that provides that control
* Fix NPE in test
* Remove documentation for configs that are no longer needed
* Address code review comments
* Address more code review comments
* Fix checkstyle issue
* Address code review comments
* Code review comments
* Add back monitor node remove executor
* Cleanup code to isolate null checks and minor refactoring
* Change param name since it conflicts with member variable name
This feature allows Calcite's Bindable interpreter to be bolted on
top of Druid queries and table scans. I think it should be removed for
a few reasons:
1. It is not recommended for production anyway, because it generates
unscalable query plans (e.g. it will plan a join into two table scans
and then try to do the entire join in memory on the broker).
2. It doesn't work with Druid-specific SQL functions, like TIME_FLOOR,
REGEXP_EXTRACT, APPROX_COUNT_DISTINCT, etc.
3. It makes the SQL planning code needlessly complicated.
With SQL coming out of experimental status soon, it's a good opportunity
to remove this feature.
* Contributing Moving-Average Query to open source.
* Fix failing code inspections.
* See if explicit types will invoke the correct comparison function.
* Explicitly remove support for druid.generic.useDefaultValueForNull configuration parameter.
* Update styling and headers for complience.
* Refresh code with latest master changes:
* Remove NullDimensionSelector.
* Apply changes of RequestLogger.
* Apply changes of TimelineServerView.
* Small checkstyle fix.
* Checkstyle fixes.
* Fixing rat errors; Teamcity errors.
* Removing support theta sketches. Will be added back in this pr or a following once DI conflicts with datasketches are resolved.
* Implements some of the review fixes.
* Contributing Moving-Average Query to open source.
* Fix failing code inspections.
* See if explicit types will invoke the correct comparison function.
* Explicitly remove support for druid.generic.useDefaultValueForNull configuration parameter.
* Update styling and headers for complience.
* Refresh code with latest master changes:
* Remove NullDimensionSelector.
* Apply changes of RequestLogger.
* Apply changes of TimelineServerView.
* Small checkstyle fix.
* Checkstyle fixes.
* Fixing rat errors; Teamcity errors.
* Removing support theta sketches. Will be added back in this pr or a following once DI conflicts with datasketches are resolved.
* Implements some of the review fixes.
* More fixes for review.
* More fixes from review.
* MapBasedRow is Unmodifiable. Create new rows instead of modifying existing ones.
* Remove more changes related to datasketches support.
* Refactor BaseAverager startFrom field and add a comment.
* fakeEvents field: Refactor initialization and add comment.
* Rename parameters (tiny change).
* Fix variable name typo in test (JAN_4).
* Fix styling of non camelCase fields.
* Fix Preconditions.checkArgument for cycleSize.
* Add more documentation to RowBucketIterable and other classes.
* key/value comment on in MovingAverageIterable.
* Fix anonymous makeColumnValueSelector returning null.
* Replace IdentityYieldingAccumolator with Yielders.each().
* * internalNext() should return null instead of throwing exception.
* Remove unused variables/prarameters.
* Harden MovingAverageIterableTest (Switch anyOf to exact match).
* Change internalNext() from recursion to iteration; Simplify next() and hasNext().
* Remove unused imports.
* Address review comments.
* Rename fakeEvents to emptyEvents.
* Remove redundant parameter key from computeMovingAverage.
* Check yielder as well in RowBucketIterable#hasNext()
* Fix javadoc.
* Add reload by interval API
Implements the reload proposal of #7439
Added tests and updated docs
* PR updates
* Only build timeline with required segments
Use 404 with message when a segmentId is not found
Fix typo in doc
Return number of segments modified.
* Fix checkstyle errors
* Replace String.format with StringUtils.format
* Remove return value
* Expand timeline to segments that overlap for intervals
Restrict update call to only segments that need updating.
* Only add overlapping enabled segments to the timeline
* Some renames for clarity
Added comments
* Don't rely on cached poll data
Only fetch required information from DB
* Match error style
* Merge and cleanup doc
* Fix String.format call
* Add unit tests
* Fix unit tests that check for overshadowing
* Add api to drop data by interval
* update to address comments
* unused imports
* PR comments + add tests in SQLMetadataSegmentManagerTest
* update tests and docs
* Implement round function in druid-sql
* Return value according to the type of argument
* Fix codes for abnoraml inputs, updated math-expr.md
* Fix assert text
* Fix error messages and refactor codes
* Fix compile error, update sql.md, refactor codes and format tests
* Enhnace the HttpFirehose to work with both insecure URIs and URIs requiring basic authentication
* Improve security of enhanced HttpFirehoseFactory by not logging auth credentials
* Fix checkstyle failure in HttpFirehoseFactory.java
* Update docs and fix TeamCity build with required noinspection
* Indentation cleanup and logic modification for HttpFirehose object stream
* Remove default Empty string password provider in http firehose
* Add JavaDoc for MixIn describing its intended use
* Reverting documentation notation for json code to be inline with rest of doc
* Improve instantiation of ObjectMappers that require MixIn for redacting password from task logs
* Add comment to clarify fully qualified references of Objects in SQLMetadataStorageActionHandler
* Update scan query runner factory to accept SpecificSegmentSpec
* nit
* Sorry travis
* Improve logging and fix doc
* Bug fix
* Friendlier error msgs and tests to cover bug
* Address Gian's comments
* Fix doc
* Added tests for empty and null column list
* Style
* Fix checking wrong order (looking at query param when it should be
looking at the null-handled order)
* Add test case for null order
* Fix ScanQueryRunnerTest
* Forbidden APIs fixed
Removes the coordinator sanity check that prevents it from dropping all
segments. It's useful to get rid of this, since the behavior is
unintuitive for dev/testing clusters where users might regularly want
to drop all their data to get back to a clean slate.
But the sanity check was there for a reason: to prevent a race condition
where the coordinator might drop all segments if it ran before the
first metadata store poll finished. This patch addresses that concern
differently, by allowing methods in MetadataSegmentManager to return
null if a poll has not happened yet, and canceling coordinator runs
in that case.
This patch also makes the "dataSources" reference in
SQLMetadataSegmentManager volatile. I'm not sure why it wasn't volatile
before, but it seems necessary to me: it's not final, and it's dereferenced
from multiple threads without synchronization.