* Fix lookup docs
* Fix spelling
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Main change: clarify that the "default value" for casts only applies if
druid.generic.useDefaultValueForNull = true.
Secondary change: adjust a bunch of wording from future to present tense.
* initial commit for jdbc tutorial
(cherry picked from commit 04c4adad71e5436b76c3425fe369df03aaaf0acb)
* add commentary
* address comments from charles
* add query context to example
* fix typo
* add links
* Apply suggestions from code review
Co-authored-by: Frank Chen <frankchen@apache.org>
* fix datatype
* address feedback
* add parameterize to spelling file. the past tense version was already there
Co-authored-by: Frank Chen <frankchen@apache.org>
* Fix typo
* Fix some spacing
* Add missing fields
* Cleanup table spacing
* Remove durable storage docs again
Thanks Brian for pointing out previous discussions.
* Update docs/multi-stage-query/reference.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Mark codes as code
* And even more codes as code
* Another set of spaces
* Combine `ColumnTypeNotSupported`
Thanks Karan.
* More whitespaces and typos
* Add spelling and fix links
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Always return sketches from DS_HLL, DS_THETA, DS_QUANTILES_SKETCH.
These aggregation functions are documented as creating sketches. However,
they are planned into native aggregators that include finalization logic
to convert the sketch to a number of some sort. This creates an
inconsistency: the functions sometimes return sketches, and sometimes
return numbers, depending on where they lie in the native query plan.
This patch changes these SQL aggregators to _never_ finalize, by using
the "shouldFinalize" feature of the native aggregators. It already
existed for theta sketches. This patch adds the feature for hll and
quantiles sketches.
As to impact, Druid finalizes aggregators in two cases:
- When they appear in the outer level of a query (not a subquery).
- When they are used as input to an expression or finalizing-field-access
post-aggregator (not any other kind of post-aggregator).
With this patch, the functions will no longer be finalized in these cases.
The second item is not likely to matter much. The SQL functions all declare
return type OTHER, which would be usable as an input to any other function
that makes sense and that would be planned into an expression.
So, the main effect of this patch is the first item. To provide backwards
compatibility with anyone that was depending on the old behavior, the
patch adds a "sqlFinalizeOuterSketches" query context parameter that
restores the old behavior.
Other changes:
1) Move various argument-checking logic from runtime to planning time in
DoublesSketchListArgBaseOperatorConversion, by adding an OperandTypeChecker.
2) Add various JsonIgnores to the sketches to simplify their JSON representations.
3) Allow chaining of ExpressionPostAggregators and other PostAggregators
in the SQL layer.
4) Avoid unnecessary FieldAccessPostAggregator wrapping in the SQL layer,
now that expressions can operate on complex inputs.
5) Adjust return type to thetaSketch (instead of OTHER) in
ThetaSketchSetBaseOperatorConversion.
* Fix benchmark class.
* Fix compilation error.
* Fix ThetaSketchSqlAggregatorTest.
* Hopefully fix ITAutoCompactionTest.
* Adjustment to ITAutoCompactionTest.
* Clarified the behaviour of COUNT(DISTINCT column) on multi-value columns
* Update docs/querying/sql-aggregations.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Vadim Ogievetsky <vadimon@gmail.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Various documentation updates.
1) Split out "data management" from "ingestion". Break it into thematic pages.
2) Move "SQL-based ingestion" into the Ingestion category. Adjust content so
all conceptual content is in concepts.md and all syntax content is in reference.md.
Shorten the known issues page to the most interesting ones.
3) Add SQL-based ingestion to the ingestion method comparison page. Remove the
index task, since index_parallel is just as good when maxNumConcurrentSubTasks: 1.
4) Rename various mentions of "Druid console" to "web console".
5) Add additional information to ingestion/partitioning.md.
6) Remove a mention of Tranquility.
7) Remove a note about upgrading to Druid 0.10.1.
8) Remove no-longer-relevant task types from ingestion/tasks.md.
9) Move ingestion/native-batch-firehose.md to the hidden section. It was previously deprecated.
10) Move ingestion/native-batch-simple-task.md to the hidden section. It is still linked in some
places, but it isn't very useful compared to index_parallel, so it shouldn't take up space
in the sidebar.
11) Make all br tags self-closing.
12) Certain other cosmetic changes.
13) Update to node-sass 7.
* make travis use node12 for docs
Co-authored-by: Vadim Ogievetsky <vadim@ogievetsky.com>
Co-authored-by: Clint Wylie <cjwylie@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: brian.le <brian.le@imply.io>
* Adjust "in" filter null behavior to match "selector".
Now, both of them match numeric nulls if constructed with a "null" value.
This is consistent as far as native execution goes, but doesn't match
the behavior of SQL = and IN. So, to address that, this patch also
updates the docs to clarify that the native filters do match nulls.
This patch also updates the SQL docs to describe how Boolean logic is
handled in addition to how NULL values are handled.
Fixes#12856.
* Fix test.
* Automatic sizing for GroupBy dictionary sizes.
Merging and selector dictionary sizes currently both default to 100MB.
This is not optimal, because it can lead to OOM on small servers and
insufficient resource utilization on larger servers. It also invites
end users to try to tune it when queries run out of dictionary space,
which can make things worse if the end user sets it to too high.
So, this patch:
- Adds automatic tuning for selector and merge dictionaries. Selectors
use up to 15% of the heap and merge buffers use up to 30% of the heap
(aggregate across all queries).
- Updates out-of-memory error messages to emphasize enabling disk
spilling vs. increasing memory parameters. With the memory parameters
automatically sized, it is more likely that an end user will get
benefit from enabling disk spilling.
- Removes the query context parameters that allow lowering of configured
dictionary sizes. These complicate the calculation, and I don't see a
reasonable use case for them.
* Adjust tests.
* Review adjustments.
* Additional comment.
* Remove unused import.
* IMPLY-12348: Updated description of UNION ALL
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/sql.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update sql.md
* Update docs/querying/sql.md
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Co-authored-by: Victoria Lim <vtlim@users.noreply.github.com>
* Add TIME_IN_INTERVAL SQL operator.
The operator is implemented as a convertlet rather than an
OperatorConversion, because this allows it to be equivalent to using
the >= and < operators directly.
* SqlParserPos cannot be null here.
* Remove unused import.
* Doc updates.
* Add words to dictionary.
* Small addition to Multitenancy considerations doc
* Update docs/querying/multitenancy.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update multitenancy.md
Edit suggested by @kfaraz
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* ConcurrentGrouper: Add option to always slice up merge buffers thread-locally.
Normally, the ConcurrentGrouper shares merge buffers across processing
threads until spilling starts, and then switches to a thread-local model.
This minimizes memory use and reduces likelihood of spilling, which is
good, but it creates thread contention. The new mergeThreadLocal option
causes a query to start in thread-local mode immediately, and allows us
to experiment with the relative performance of the two modes.
* Fix grammar in docs.
* Fix race in ConcurrentGrouper.
* Fix issue with timeouts.
* Remove unused import.
* Add "tradeoff" to dictionary.
* SQL: Add is_active to sys.segments, update examples and docs.
is_active is short for:
(is_published = 1 AND is_overshadowed = 0) OR is_realtime = 1
It's important because this represents "all the segments that should
be queryable, whether or not they actually are right now". Most of the
time, this is the set of segments that people will want to look at.
The web console already adds this filter to a lot of its queries,
proving its usefulness.
This patch also reworks the caveat at the bottom of the sys.segments
section, so its information is mixed into the description of each result
field. This should make it more likely for people to see the information.
* Wording updates.
* Adjustments for spellcheck.
* Adjust IT.
In the majority of cases, this improves performance.
There's only one case I'm aware of where this may be a net negative: for time_floor(__time, <period>) where there are many repeated __time values. In nonvectorized processing, SingleLongInputCachingExpressionColumnValueSelector implements an optimization to avoid computing the time_floor function on every row. There is no such optimization in vectorized processing.
IMO, we shouldn't mention this in the docs. Rationale: It's too fiddly of a thing: it's not guaranteed that nonvectorized processing will be faster due to the optimization, because it would have to overcome the inherent speed advantage of vectorization. So it'd always require testing to determine the best setting for a specific dataset. It would be bad if users disabled vectorization thinking it would speed up their queries, and it actually slowed them down. And even if users do their own testing, at some point in the future we'll implement the optimization for vectorized processing too, and it's likely that users that explicitly disabled vectorization will continue to have it disabled. I'd like to avoid this outcome by encouraging all users to enable vectorization at all times. Really advanced users would be following development activity anyway, and can read this issue
setting thread names takes a measurable amount of time in the case where segment scans are very quick. In high-QPS testing we found a slight performance boost from turning off processing thread renaming. This option makes that possible.
* Add feature flag for sql planning of TimeBoundary queries
* fixup! Add feature flag for sql planning of TimeBoundary queries
* Add documentation for enableTimeBoundaryPlanning
* fixup! Add documentation for enableTimeBoundaryPlanning
* Reduce allocations due to Jackson serialization.
This patch attacks two sources of allocations during Jackson
serialization:
1) ObjectMapper.writeValue and JsonGenerator.writeObject create a new
DefaultSerializerProvider instance for each call. It has lots of
fields and creates pressure on the garbage collector. So, this patch
adds helper functions in JacksonUtils that enable reuse of
SerializerProvider objects and updates various call sites to make
use of this.
2) GroupByQueryToolChest copies the ObjectMapper for every query to
install a special module that supports backwards compatibility with
map-based rows. This isn't needed if resultAsArray is set and
all servers are running Druid 0.16.0 or later. This release was a
while ago. So, this patch disables backwards compatibility by default,
which eliminates the need to copy the heavyweight ObjectMapper. The
patch also introduces a configuration option that allows admins to
explicitly enable backwards compatibility.
* Add test.
* Update additional call sites and add to forbidden APIs.
* Update caching.md
Knowledge from https://the-asf.slack.com/archives/CJ8D1JTB8/p1597781107153900
Update caching.md
A few additional updates OTBO https://the-asf.slack.com/archives/CJ8D1JTB8/p1608669046041300
* Update caching.md
Typos
* Amendments on the segment cache
Significant updates on content around the segment cache, pull process, and in-memory cache
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/operations/basic-cluster-tuning.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update docs/operations/basic-cluster-tuning.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Update basic-cluster-tuning.md
typo
* Update docs/querying/caching.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Whole-query caching update
Made more succinct and removed specific config to change.
* Update docs/design/historical.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Added Calcites InQueryThreshold as a query context parameter. Setting this parameter appropriately reduces the time taken for queries with large number of values in their IN conditions.