* docs: add juptyer API tutorial for API and jupyter tutorial index (#3)
(cherry picked from commit aeb8d9e3390fa26d9c533dce0862295b80c58583)
* update prereqs and fix jupyterlab name
* Removing notebook since 13345 has it
13345 should be merged first
* update contributing instructions
* docs: link to the Druid SQL tutorial
* Add link to partitioning
* fix merge conflict
* Saving
* Update docs/tutorials/tutorial-jupyter-index.md
* Remove partitioning
---------
Co-authored-by: 317brian <53799971+317brian@users.noreply.github.com>
Co-authored-by: brian.le <brian.le@imply.io>
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* Better sidecar support
* remove un-thrown exception from test
* Druid you are such a stickler about spelling :)
* Only require the primaryContainerName, no need to exclude containers
Without perl 5 I was unable to start druid using the instructions in the quickstart guide. I'm not certain what versions it might require, but the one that I got working was perl 5
> This is perl 5, version 36, subversion 0 (v5.36.0) built for x86_64-linux-thread-multi
### Description
This change adds a new config property `druid.sql.planner.operatorConversion.denyList`, which allows a user to specify
any operator conversions that they wish to disallow. A user may want to do this for a number of reasons, including security concerns. The default value of this property is the empty list `[]`, which does not disallow any operator conversions.
An example usage of this property is `druid.sql.planner.operatorConversion.denyList=["extern"]`, which disallows the usage of the `extern` operator conversion. If the property is configured this way, and a user of the Druid cluster tries to submit a query that uses the `extern` function, such as the example given [here](https://druid.apache.org/docs/latest/multi-stage-query/examples.html#insert-with-no-rollup), a response with http response code `400` is returned with en error body similar to the following:
```
{
"taskId": "4ec5b0b6-fa9b-4c3a-827d-2308294e9985",
"state": "FAILED",
"error": {
"error": "Plan validation failed",
"errorMessage": "org.apache.calcite.runtime.CalciteContextException: From line 28, column 5 to line 32, column 5: No match found for function signature EXTERN(<CHARACTER>, <CHARACTER>, <CHARACTER>)",
"errorClass": "org.apache.calcite.tools.ValidationException",
"host": null
}
}
```
* Allow users to add additional metadata to ingestion metrics
When submitting an ingestion spec, users may pass a map of metadata
in the ingestion spec config that will be added to ingestion metrics.
This will make it possible for operators to tag metrics with other
metadata that doesn't necessarily line up with the existing tags
like taskId.
Druid clusters that ingest these metrics can take advantage of the
nested data columns feature to process this additional metadata.
* rename to tags
* docs
* tests
* fix test
* make code cov happy
* checkstyle
Merging regardless of nit since topic is in better shape.
* refresh the update data tutorial
* Apply suggestions from code review
Co-authored-by: Jill Osborne <jill.osborne@imply.io>
---------
Co-authored-by: Jill Osborne <jill.osborne@imply.io>
* Fix lookup docs
* Fix spelling
* Apply suggestions from code review
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
---------
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Add a new API to return the history of changes to automatic compaction config history to make it easy for users to see what changes have been made to their auto-compaction config.
The API is scoped per dataSource to allow users to triage issues with an individual dataSource. The API responds with a list of configs when there is a change to either the settings that impact all auto-compaction configs on a cluster or the dataSource in question.
Much improved table functions
* Revises properties, definitions in the catalog
* Adds a "table function" abstraction to model such functions
* Specific functions for HTTP, inline, local and S3.
* Extended SQL types in the catalog
* Restructure external table definitions to use table functions
* EXTEND syntax for Druid's extern table function
* Support for array-valued table function parameters
* Support for array-valued SQL query parameters
* Much new documentation
* Kinesis: More robust default fetch settings.
1) Default recordsPerFetch and recordBufferSize based on available memory
rather than using hardcoded numbers. For this, we need an estimate
of record size. Use 10 KB for regular records and 1 MB for aggregated
records. With 1 GB heaps, 2 processors per task, and nonaggregated
records, recordBufferSize comes out to the same as the old
default (10000), and recordsPerFetch comes out slightly lower (1250
instead of 4000).
2) Default maxRecordsPerPoll based on whether records are aggregated
or not (100 if not aggregated, 1 if aggregated). Prior default was 100.
3) Default fetchThreads based on processors divided by task count on
Indexers, rather than overall processor count.
4) Additionally clean up the serialized JSON a bit by adding various
JsonInclude annotations.
* Updates for tests.
* Additional important verify.
* reword single server page
* fix typo
* Update docs/operations/single-server.md
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
* spelling
Co-authored-by: Charles Smith <techdocsmith@gmail.com>
Changes:
- Remove specification of a Druid version in the quickstart, because the previous step
instructs downloading the latest version anyway.
- Mention usage of memory parameter in the quickstart
Main change: clarify that the "default value" for casts only applies if
druid.generic.useDefaultValueForNull = true.
Secondary change: adjust a bunch of wording from future to present tense.
Follow up to #13520
Bytes processed are currently tracked for intermediate stages in MSQ ingestion.
This patch adds the capability to track the bytes processed by an MSQ controller
task while reading from an external input source or a segment source.
Changes:
- Track `processedBytes` for every `InputSource` read in `ExternalInputSliceReader`
- Update `ChannelCounters` with the above obtained `processedBytes` when incrementing
the input file count.
- Update task report structure in docs
The total input processed bytes can be obtained by summing the `processedBytes` as follows:
totalBytes = 0
for every root stage (i.e. a stage which does not have another stage as an input):
for every worker in that stage:
for every input channel: (i.e. channels with prefix "input", e.g. "input0", "input1", etc.)
totalBytes += processedBytes
* Add validation checks to worker chat handler apis
* Merge things and polishing the error messages.
* Minor error message change
* Fixing race and adding some tests
* Fixing controller fetching stats from wrong workers.
Fixing race
Changing default mode to Parallel
Adding logging.
Fixing exceptions not propagated properly.
* Changing to kernel worker count
* Added a better logic to figure out assigned worker for a stage.
* Nits
* Moving to existing kernel methods
* Adding more coverage
Co-authored-by: cryptoe <karankumar1100@gmail.com>
* Zero-copy local deep storage.
This is useful for local deep storage, since it reduces disk usage and
makes Historicals able to load segments instantaneously.
Two changes:
1) Introduce "druid.storage.zip" parameter for local storage, which defaults
to false. This changes default behavior from writing an index.zip to writing
a regular directory. This is safe to do even during a rolling update, because
the older code actually already handled unzipped directories being present
on local deep storage.
2) In LocalDataSegmentPuller and LocalDataSegmentPusher, use hard links
instead of copies when possible. (Generally this is possible when the
source and destination directory are on the same filesystem.)