Fixes KafkaEmitter not emitting queryType for a native query. The Event to JSON serialization was extracted to the external class: EventToJsonSerializer. This was done to simplify the testing logic for the serialization as well as extract the responsibility of serialization to the separate class.
The logic builds ObjectNode incrementally based on the event .toMap method. Parsing each entry individually ensures that the Jackson polymorphic annotations are respected. Not respecting these annotation caused the missing of the queryType from output event.
The Avro parsing code leaks some "object" representations.
We need to convert them into Maps/Lists so that other code
can understand and expect good things. Previously, these
objects were handled with .toString(), but that's not a
good contract in terms of how to work with objects.
The method wasn't following its contract, leading to pollution of the
overall planner context, when really we just want to create a new
context for a specific query.
Added link to the relevant section of the Basic Cluster Tuning page on each process page.
This is in order to improve access to this information, which is not easy to find through search or nav.
* Fix race in TaskQueue.notifyStatus.
It was possible for manageInternal to relaunch a task while it was
being cleaned up, due to a race that happens when notifyStatus is
called to clean up a successful task:
1) In a critical section, notifyStatus removes the task from "tasks".
2) Outside a critical section, notifyStatus calls taskRunner.shutdown
to let the task runner know it can clear out its data structures.
3) In a critical section, syncFromStorage adds the task back to "tasks",
because it is still present in metadata storage.
4) In a critical section, manageInternalCritical notices that the task
is in "tasks" and is not running in the taskRunner, so it launches
it again.
5) In a (different) critical section, notifyStatus updates the metadata
store to set the task status to SUCCESS.
6) The task continues running even though it should not be.
The possibility for this race was introduced in #12099, which shrunk
the critical section in notifyStatus. Prior to that patch, a single
critical section encompassed (1), (2), and (5), so the ordering above
was not possible.
This patch does the following:
1) Fixes the race by adding a recentlyCompletedTasks set that prevents
the main management loop from doing anything with tasks that are
currently being cleaned up.
2) Switches the order of the critical sections in notifyStatus, so
metadata store updates happen first. This is useful in case of
server failures: it ensures that if the Overlord fails in the midst
of notifyStatus, then completed-task statuses are still available in
ZK or on MMs for the next Overlord. (Those are cleaned up by
taskRunner.shutdown, which formerly ran first.) This isn't related
to the race described above, but is fixed opportunistically as part
of the same patch.
3) Changes the "tasks" list to a map. Many operations require retrieval
or removal of individual tasks; those are now O(1) instead of O(N)
in the number of running tasks.
4) Changes various log messages to use task ID instead of full task
payload, to make the logs more readable.
* Fix format string.
* Update comment.
* SQL: Morph QueryMakerFactory into SqlEngine.
Groundwork for introducing an indexing-service-task-based SQL engine
under the umbrella of #12262. Also includes some other changes related
to improving error behavior.
Main changes:
1) Elevate the QueryMakerFactory interface (an extension point that allows
customization of how queries are made) into SqlEngine. SQL engines
can influence planner behavior through EngineFeatures, and can fully
control the mechanics of query execution using QueryMakers.
2) Remove the server-wide QueryMakerFactory choice, in favor of the choice
being made by the SQL entrypoint. The indexing-service-task-based
SQL engine would be associated with its own entrypoint, like
/druid/v2/sql/task.
Other changes:
1) Adjust DruidPlanner to try either DRUID or BINDABLE convention based
on analysis of the planned rels; never try both. In particular, we
no longer try BINDABLE when DRUID fails. This simplifies the logic
and improves error messages.
2) Adjust error message "Cannot build plan for query" to omit the SQL
query text. Useful because the text can be quite long, which makes it
easy to miss the text about the problem.
3) Add a feature to block context parameters used internally by the SQL
planner from being supplied by end users.
4) Add a feature to enable adding row signature to the context for
Scan queries. This is useful in building the task-based engine.
5) Add saffron.properties file that turns off sets and graphviz dumps
in "cannot plan" errors. Significantly reduces log spam on the Broker.
* Fixes from CI.
* Changes from review.
* Can vectorize, now that join-to-filter is on by default.
* Checkstyle! And variable renames!
* Remove throws from test.
* Error handling improvements for frame channels.
Two changes:
1) Send errors down in-memory channels (BlockingQueueFrameChannel) on
failure. This ensures that in situations where a chain of processors
has been set up on a single machine, all processors see the root
cause error. In particular, this means the final processor in the
chain reports the root cause error, which ensures that someone with
a handle to the final processor will get the proper error.
2) Update FrameFileHttpResponseHandler to expect that the final fetch,
rather than being simply empty, is also empty with a special header.
This ensures that the handler is able to tell the difference between
an empty fetch due to being at EOF, and an empty fetch due to a
truncated HTTP response (after the 200 OK and headers are sent down,
but before any content appears).
* Fix tests, imports.
* Checkstyle!
* Refactor SqlLifecycle into statement classes
Create direct & prepared statements
Remove redundant exceptions from tests
Tidy up Calcite query tests
Make PlannerConfig more testable
* Build fixes
* Added builder to SqlQueryPlus
* Moved Calcites system properties to saffron.properties
* Build fix
* Resolve merge conflict
* Fix IntelliJ inspection issue
* Revisions from reviews
Backed out a revision to Calcite tests that didn't work out as planned
* Build fix
* Fixed spelling errors
* Fixed failed test
Prepare now enforces security; before it did not.
* Rebase and fix IntelliJ inspections issue
* Clean up exception handling
* Fix handling of JDBC auth errors
* Build fix
* More tweaks to security messages
* changes to run examples on macos where cd command returns current directory
* Update examples/bin/run-druid
Co-authored-by: Frank Chen <frankchen@apache.org>
* merging
* sending output of cd command to /dev/null
Co-authored-by: Frank Chen <frankchen@apache.org>
This is used to control access to the EXTERN function, which allows
reading external data in SQL. The EXTERN function is not usable in
production as of today, but it is used by the task-based SQL engine
contemplated in #12262.
* Introduce defaultOnDiskStorage config for groupBy
* add debug log to groupby query config
* Apply config change suggestion from review
* Remove accidental new lines
* update default value of new default disk storage config
* update debug log to have more descriptive text
* Make maxOnDiskStorage and defaultOnDiskStorage HumanRedadableBytes
* improve test coverage
* Provide default implementation to new default method on advice of reviewer
In the current druid code base, we have the interface DataSegmentPusher which allows us to push segments to the appropriate deep storage without the extension being worried about the semantics of how to push too deep storage.
While working on #12262, whose some part of the code will go as an extension, I realized that we do not have an interface that allows us to do basic "write, get, delete, deleteAll" operations on the appropriate deep storage without let's say pulling the s3-storage-extension dependency in the custom extension.
Hence, the idea of StorageConnector was born where the storage connector sits inside the druid core so all extensions have access to it.
Each deep storage implementation, for eg s3, GCS, will implement this interface.
Now with some Jackson magic, we bind the implementation of the correct deep storage implementation on runtime using a type variable.
The Netty pipeline set up by the client can deliver multiple exceptions,
and can deliver chunks even after delivering exceptions. This makes it
difficult to implement HttpResponseHandlers. Looking at existing handler
implementations, I do not see attempts to handle this case, so it's also
a potential source of bugs.
This patch updates the client to track whether an exception was
encountered, and if so, to not call any additional methods on the handler
after exceptionCaught. It also harmonizes exception handling between
exceptionCaught and channelDisconnected.
Refactors the DruidSchema and DruidTable abstractions to prepare for the Druid Catalog.
As we add the catalog, we’ll want to combine physical segment metadata information with “hints” provided by the catalog. This is best done if we tidy up the existing code to more clearly separate responsibilities.
This PR is purely a refactoring move: no functionality changed. There is no difference to user functionality or external APIs. Functionality changes will come later as we add the catalog itself.
DruidSchema
In the present code, DruidSchema does three tasks:
Holds the segment metadata cache
Interfaces with an external schema manager
Acts as a schema to Calcite
This PR splits those responsibilities.
DruidSchema holds the Calcite schema for the druid namespace, combining information fro the segment metadata cache, from the external schema manager and (later) from the catalog.
SegmentMetadataCache holds the segment metadata cache formerly in DruidSchema.
DruidTable
The present DruidTable class is a bit of a kitchen sink: it holds all the various kinds of tables which Druid supports, and uses if-statements to handle behavior that differs between types. Yet, any given DruidTable will handle only one such table type. To more clearly model the actual table types, we split DruidTable into several classes:
DruidTable becomes an abstract base class to hold Druid-specific methods.
DatasourceTable represents a datasource.
ExternalTable represents an external table, such as from EXTERN or (later) from the catalog.
InlineTable represents the internal case in which we attach data directly to a table.
LookupTable represents Druid’s lookup table mechanism.
The new subclasses are more focused: they can be selective about the data they hold and the various predicates since they represent just one table type. This will be important as the catalog information will differ depending on table type and the new structure makes adding that logic cleaner.
DatasourceMetadata
Previously, the DruidSchema segment cache would work with DruidTable objects. With the catalog, we need a layer between the segment metadata and the table as presented to Calcite. To fix this, the new SegmentMetadataCache class uses a new DatasourceMetadata class as its cache entry to hold only the “physical” segment metadata information: it is up to the DruidTable to combine this with the catalog information in a later PR.
More Efficient Table Resolution
Calcite provides a convenient base class for schema objects: AbstractSchema. However, this class is a bit too convenient: all we have to do is provide a map of tables and Calcite does the rest. This means that, to resolve any single datasource, say, foo, we need to cache segment metadata, external schema information, and catalog information for all tables. Just so Calcite can do a map lookup.
There is nothing special about AbstractSchema. We can handle table lookups ourselves. The new AbstractTableSchema does this. In fact, all the rest of Calcite wants is to resolve individual tables by name, and, for commands we don’t use, to provide a list of table names.
DruidSchema now extends AbstractTableSchema. SegmentMetadataCache resolves individual tables (and provides table names.)
DruidSchemaManager
DruidSchemaManager provides a way to specify table schemas externally. In this sense, it is similar to the catalog, but only for datasources. It originally followed the AbstractSchema pattern: it implements provide a map of tables. This PR provides new optional methods for the table lookup and table names operations. The default implementations work the same way that AbstractSchema works: we get the entire map and pick out the information we need. Extensions that use this API should be revised to support the individual operations instead. Druid code no longer calls the original getTables() method.
The PR has one breaking change: since the DruidSchemaManager map is read-only to the rest of Druid, we should return a Map, not a ConcurrentMap.
* change kafka lookups module to not commit offsets
The current behaviour of the Kafka lookup extractor is to not commit
offsets by assigning a unique ID to the consumer group and setting
auto.offset.reset to earliest. This does the job but also pollutes the
Kafka broker with a bunch of "ghost" consumer groups that will never again be
used.
To fix this, we now set enable.auto.commit to false, which prevents the
ghost consumer groups being created in the first place.
* update docs to include new enable.auto.commit setting behaviour
* update kafka-lookup-extractor documentation
Provide some additional detail on functionality and configuration.
Hopefully this will make it clearer how the extractor works for
developers who aren't so familiar with Kafka.
* add comments better explaining the logic of the code
* add spelling exceptions for kafka lookup docs
* remove kafka lookup records from factory when record tombstoned
* update kafka lookup docs to include tombstone behaviour
* change test wait time down to 10ms
Co-authored-by: David Palmer <david.palmer@adscale.co.nz>