FastVectorHighlighter uses the top-level reader to rewrite queries against, which
it gets via an IndexSearcher field on HitContext. However, we can already access
this top-level reader via HitContext's existing LeafReaderContext field.
This commit removes the unnecessary field and constructor parameter, and
changes the implementation of topLevelReader to go via ReaderUtils and
the leaf reader context.
This PR adds support for the 'fields' option in the following places:
* Anytime `inner_hits` is used, for both fetching nested/ child docs and field collapsing
* The `top_hits` aggregation
Addresses #61949.
Today some uncaught shard failures such as RejectedExecutionException skips the release of shard context
and let subsequent scroll requests access the same shard context again. Depending on how the other shards advanced,
this behavior can lead to missing data since scrolls always move forward.
In order to avoid hidden data loss, this commit ensures that we always release the context of shard search scroll requests whenever a failure
occurs locally. The shard search context will no longer exist in subsequent scroll requests which will lead to consistent shard failures
in the responses.
This change also modifies the retry tests of the reindex feature. Reindex retries scroll search request that contains a shard failure and
move on whenever the failure disappears. That is not compatible with how scrolls work and can lead to missing data as explained above.
That means that reindex will now report scroll failures when search rejection happen during the operation instead of skipping document
silently.
Finally this change removes an old TODO that was fulfilled with #61062.
This commit introduces a new API that manages point-in-times in x-pack
basic. Elasticsearch pit (point in time) is a lightweight view into the
state of the data as it existed when initiated. A search request by
default executes against the most recent point in time. In some cases,
it is preferred to perform multiple search requests using the same point
in time. For example, if refreshes happen between search_after requests,
then the results of those requests might not be consistent as changes
happening between searches are only visible to the more recent point in
time.
A point in time must be opened before being used in search requests. The
`keep_alive` parameter tells Elasticsearch how long it should keep a
point in time around.
```
POST /my_index/_pit?keep_alive=1m
```
The response from the above request includes a `id`, which should be
passed to the `id` of the `pit` parameter of search requests.
```
POST /_search
{
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"pit": {
"id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==",
"keep_alive": "1m"
}
}
```
Point-in-times are automatically closed when the `keep_alive` is
elapsed. However, keeping point-in-times has a cost; hence,
point-in-times should be closed as soon as they are no longer used in
search requests.
```
DELETE /_pit
{
"id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA="
}
```
#### Notable works in this change:
- Move the search state to the coordinating node: #52741
- Allow searches with a specific reader context: #53989
- Add the ability to acquire readers in IndexShard: #54966
Relates #46523
Relates #26472
Co-authored-by: Jim Ferenczi <jimczi@apache.org>
This commit removes `integTest` task from all es-plugins.
Most relevant projects have been converted to use yamlRestTest, javaRestTest,
or internalClusterTest in prior PRs.
A few projects needed to be adjusted to allow complete removal of this task
* x-pack/plugin - converted to use yamlRestTest and javaRestTest
* plugins/repository-hdfs - kept the integTest task, but use `rest-test` plugin to define the task
* qa/die-with-dignity - convert to javaRestTest
* x-pack/qa/security-example-spi-extension - convert to javaRestTest
* multiple projects - remove the integTest.enabled = false (yay!)
related: #61802
related: #60630
related: #59444
related: #59089
related: #56841
related: #59939
related: #55896
This also adds the ability to define a serialization check on Parameters, used
in this case to only serialize format and locale parameters if the mapper is a
date range.
There are currently half a dozen ways to add plugins and modules for
test clusters to use. All of them require the calling project to peek
into the plugin or module they want to use to grab its bundlePlugin
task, and then both depend on that task, as well as extract the archive
path the task will produce. This creates cross project dependencies that
are difficult to detect, and if the dependent plugin/module has not yet
been configured, the build will fail because the task does not yet
exist.
This commit makes the plugin and module methods for testclusters
symmetetric, and simply adding a file provider directly, or a project
path that will produce the plugin/module zip. Internally this new
variant uses normal configuration/dependencies across projects to get
the zip artifact. It also has the added benefit of no longer needing the
caller to add to the test task a dependsOn for bundlePlugin task.
FetchSubPhase has two 'execute' methods, one which takes all hits to be examined,
and one which takes a single HitContext. It's not obvious which one should be implemented
by a given sub-phase, or if implementing both is a possibility; nor is it obvious that we first
run the hitExecute methods of all subphases, and then subsequently call all the
hitsExecute methods.
This commit reworks FetchSubPhase to replace these two variants with a processor class,
`FetchSubPhaseProcessor`, that is returned from a single `getProcessor` method. This
processor class has two methods, `setNextReader()` and `process`. FetchPhase collects
processors from all its subphases (if a subphase does not need to execute on the current
search context, it can return `null` from `getProcessor`). It then sorts its hits by docid, and
groups them by lucene leaf reader. For each reader group, it calls `setNextReader()` on
all non-null processors, and then passes each doc id to `process()`.
Implementations of fetch sub phases can divide their concerns into per-request, per-reader
and per-document sections, and no longer need to worry about sorting docs or dealing with
reader slices.
FetchSubPhase now provides a FetchSubPhaseExecutor that exposes two methods,
setNextReader(LeafReaderContext) and execute(HitContext). The parent FetchPhase collects all
these executors together (if a phase should not be executed, then it returns null here); then
it sorts hits, and groups them by reader; for each reader it calls setNextReader, and then
execute for each hit in turn. Individual sub phases no longer need to concern themselves with
sorting docs or keeping track of readers; global structures can be built in
getExecutor(SearchContext), per-reader structures in setNextReader and per-doc in execute.
This commit adds a test to MapperTestCase that explicitly checks that a mapper can
serialize all its default values, and that this serialization can then be re-parsed. Note that
the test is disabled for non-parametrized mappers as their serialization may in some cases
output parameters that are not accepted. Gradually moving all mappers to parametrized
form will address this.
The commit also contains a fix to keyword mappers, which were not correctly serializing
the similarity parameter; this partially addresses #61563. It also enables `null` as a
value for `null_value` on `scaled_float`, as a follow-up to #61798
We frequently use `long`s with `BitArray` in aggs and right now we have
to assert that the `long` fits in an `int`. This adds support for `long`
to `BitArray` so we don't need those assertions.
This commit enhances the verbose output for the
`_ingest/pipeline/_simulate?verbose` api. Specifically
this adds the following:
* the pipeline processor is now included in the output
* the conditional (if) and result is now included in the output iff it was defined
* a status field is always displayed. the possible values of status are
* `success` - if the processor ran with out errors
* `error` - if the processor ran but threw an error that was not ingored
* `error_ignored` - if the processor ran but threw an error that was ingored
* `skipped` - if the process did not run (currently only possible if the if condition evaluates to false)
* `dropped` - if the the `drop` processor ran and dropped the document
* a `processor_type` field for the type of processor (e.g. set, rename, etc.)
* throw a better error if trying to simulate with a pipeline that does not exist
closes#56004
Runtime fields need to have a SearchLookup available, when building their fielddata implementations, so that they can look up other fields, runtime or not.
To achieve that, we add a Supplier<SearchLookup> argument to the existing MappedFieldType#fielddataBuilder method.
As we introduce the ability to look up other fields while building fielddata for mapped fields, we implicitly add the ability for a field to require other fields. This requires some protection mechanism that detects dependency cycles to prevent stack overflow errors.
With this commit we also introduce detection for cycles, as well as a limit on the depth of the references for a runtime field. Note that we also plan on introducing cycles detection at compile time, so the runtime cycles detection is a last resort to prevent stack overflow errors but we hope that we can reject runtime fields from being registered in the mappings when they create a cycle in their definition.
Note that this commit does not introduce any production implementation of runtime fields, but is rather a pre-requisite to merge the runtime fields feature branch.
This is a breaking change for MapperPlugins that plug in a mapper, as the signature of MappedFieldType#fielddataBuilder changes from taking a single argument (the index name), to also accept a Supplier<SearchLookup>.
Relates to #59332
Co-authored-by: Nik Everett <nik9000@gmail.com>
This commit removes the tasks module that only existed to define the
tasks result index, `.tasks`, as a system index. The definition for
the tasks results system index descriptor is moved to the
`SystemIndices` class with a check that no other plugin or module
attempts to define an entry with the same source.
Additionally, this change also makes the pattern for the tasks result
index a wildcard pattern since we will need this when the index is
upgraded (reindex to new name and then alias that to .tasks).
Backport of #61540
DeprecationLogger's constructor should not create two loggers. It was
taking parent logger instance, changing its name with a .deprecation
prefix and creating a new logger.
Most of the time parent logger was not needed. It was causing Log4j to
unnecessarily cache the unused parent logger instance.
depends on #61515
backports #58435
Splitting DeprecationLogger into two. HeaderWarningLogger - responsible for adding a response warning headers and ThrottlingLogger - responsible for limiting the duplicated log entries for the same key (previously deprecateAndMaybeLog).
Introducing A ThrottlingAndHeaderWarningLogger which is a base for other common logging usages where both response warning header and logging throttling was needed.
relates #55699
relates #52369
backports #55941
Before when a value was copied to a field through a parent field or `copy_to`,
we parsed it using the `FieldMapper` from the source field. Instead we should
parse it using the target `FieldMapper`. This ensures that we apply the
appropriate mapping type and options to the copied value.
To implement the fix cleanly, this PR refactors the value parsing strategy. Now
instead of looking up values directly, field mappers produce a helper object
`ValueFetcher`. The value fetchers are responsible for almost all aspects of
fetching, including looking up the right paths in the _source.
The PR is fairly big but each commit can be reviewed individually.
Fixes#61033.
In addition, this commit converts ScaledFloatFieldMapper as it was relying
on a number of static values taken from NumberFieldMapper that had changed
or been removed.
This switches a few tests for field mappers from `ESSingleNodeTestCase`
to `ESTestCase` because, in general, we prefer to avoid
`ESSingleNodeTestCase` when we can because it is slow and "big". "Big"
here means that it pulls in an entire node, making it difficult to
reason about what you are testing.
This makes KeywordFieldMapper extend ParametrizedFieldMapper, with explicitly
defined parameters.
In addition, we add a new option to Parameter, restrictedStringParam, which
accepts a restricted set of string options.
This commit removes the ability to test the top level result of an aggregator
before it runs the final reduce. All aggregator tests that use AggregatorTestCase#search
are rewritten with AggregatorTestCase#searchAndReduce in order to ensure that we test
the final output (the one sent to the end user) rather than an intermediary result
that could be different.
This change also removes spurious commits triggered on top of a random index writer.
These commits slow down the tests and are redundant with the commits that the
random index writer performs.
I was unable to reproduce this locally on either 7.6 (first introduced) and 7.x. This is already not muted
on master and doesn't appear to have failures. There were some API changes at the time that could
have affected this test, and I'm wondering with backports if this is now stable again. If this has more
failures, I will continue to investigate further.
Relates to #51939
This commit uses the new location for the reindex java-api documentation.
Temporary files have been left behind to pacify the docs build.
related #60339
Currently, validation of mappers (checking that cross-references are correct, limits on
field name lengths and object depths, multiple definitions, etc) is performed by the
MapperService. This means that any mapper-specific validation, for example that done
on the CompletionFieldMapper, needs to be called specifically from core server code,
and so we can't add validation to mappers that live in plugins.
This commit reworks the validation framework so that mapper-specific validation is
done on the Mapper itself. Mapper gets a new `validate(MappingLookup)`
method (already present on `MetadataFieldMapper` and now pulled up to the parent
interface), which is called from a new `DocumentMapper.validate()` method. All
the validation code currently living on `MapperService` moves either to individual
mapper implementations (FieldAliasMapper, CompletionFieldMapper) or into
`MappingLookup`, an altered `DocumentFieldMappers` which now knows about
object fields and can check for duplicate definitions, or into DocumentMapper
which handles soft limit checks.
* Merge test runner task into RestIntegTest (#60261)
* Merge test runner task into RestIntegTest
* Reorganizing Standalone runner and RestIntegTest task
* Rework general test task configuration and extension
* Fix merge issues
* use former 7.x common test configuration
Previously if an inner_hits block required _ source, we would reload and parse
the root document's source for every hit. This PR adds a shared SourceLookup to
the inner hits context that allows inner hits to reuse parsed source if it's
already available. This matches our approach for sharing the root document ID.
Relates to #32818.
- Replace immediate task creations by using task avoidance api
- One step closer to #56610
- Still many tasks are created during configuration phase. Tackled in separate steps
The `SourceLookup` class provides access to the _source for a particular
document, specified through `SourceLookup#setSegmentAndDocument`. Previously
the search context contained a single `SourceLookup` that was shared between
different fetch subphases. It was hard to reason about its state: is
`SourceLookup` set to the expected document? Is the _source already loaded and
available?
Instead of using a global source lookup, the fetch hit context now provides
access to a lookup that is set to load from the hit document.
This refactor closes#31000, since the same `SourceLookup` is no longer shared
between the 'fetch _source phase' and script execution.
In #60297 we added some tests related to logging from the transport
layer, but these tests failed occasionally since the cluster
was kept alive between test invocations but the logging framework
expected it only to be used for a single test. With this commit we
reduce the scope of the internal test cluster to `TEST` to solve this
problem.
Closes#60321.
This feature adds a new `fields` parameter to the search request, which
consults both the document `_source` and the mappings to fetch fields in a
consistent way. The PR merges the `field-retrieval` feature branch.
Addresses #49028 and #55363.
Transport connections between nodes remain in place until one or other
node shuts down or the connection is disrupted by a flaky network.
Today it is very difficult to demonstrate that transient failures and
cluster instability are caused by the network even though this is often
the case. In particular, transport connections open and close without
logging anything, even at `DEBUG` level, making it very hard to quantify
the scale of the problem or to correlate the networking problems with
external events.
This commit adds the missing `DEBUG`-level logging when transport
connections open and close, and also tracks the total number of
transport connections a node has opened as a measure of the stability of
the underlying network.
Introduce a javaRestTest source set and task to compliment the yamlRestTest.
javaRestTest differs such that the code is sourced from Java and may have
different dependencies and setup requirements for the test clusters. This also
allows the tests to run in parallel in different cluster instances to prevent any
cross test contamination between the two types of tests.
Included in this PR is all :modules no longer use the integTest task. The tests
are now driven by test, yamlRestTest, javaRestTest, and internalClusterTest.
Since only :modules (and :rest-api-spec) have been converted to yamlRestTest
we can now disable the integTest task if either yamlRestTest or javaRestTest have
been applied. Once all projects are converted, we can delete the integTest task.
related: #56841
related: #59444
keepalives tell any intermediate devices that the connection remains alive, which helps with overzealous firewalls that are
killing idle connections. keepalives are enabled by default in Elasticsearch, but use system defaults for their
configuration, which often times do not have reasonable defaults (e.g. 7200s for TCP_KEEP_IDLE) in the context of
distributed systems such as Elasticsearch.
This PR sets the socket-level keep_alive options for network.tcp.{keep_idle,keep_interval} to 5 minutes on configurations
that support it (>= Java 11 & (MacOS || Linux)) and where the system defaults are set to something higher than 5
minutes. This helps keep the connections alive while not interfering with system defaults or user-specified settings
unless they are deemed to be set too high by providing better out-of-the-box defaults.
For ingest node processors a per processor description
was recently added. This commit displays that description
in the verbose output of the pipeline simulation.
related #57906
This commit allows customizing the word delimiter token filters to skip processing
tokens tagged as keyword through the `ignore_keywords` flag Lucene's
WordDelimiterGraphFilter already exposes.
Fix for #59491
We never used the `IndexSettings` parameter and we only used the
`MappedFieldType` parameter to get the name of the field which we
already know everywhere where we build the `IFD.Builder`. This allows us
to drop a fair bit of ceremony from a couple of tests.
Follow up to #59606 using some of the new infrastructure and making similar cleanups (and due to at times better handling of size hints and empty collections also optimizations in the stream utility methods this also means speedups) in various spots in the core codebase.
* Adding new `require_alias` option to indexing requests (#58917)
This commit adds the `require_alias` flag to requests that create new documents.
This flag, when `true` prevents the request from automatically creating an index. Instead, the destination of the request MUST be an alias.
When the flag is not set, or `false`, the behavior defaults to the `action.auto_create_index` settings.
This is useful when an alias is required instead of a concrete index.
closes https://github.com/elastic/elasticsearch/issues/55267
Only available in the ingest context for use in ingest pipelines.
Digests are computed on the UTF-8 encoding of the String and are
returned as hex strings.
sha1() return hex strings of length 40, sha256() returns length 64
Fixes: #59647
Backport: 3c85272
* Add doc runtime class path
* Use getAllHttpSocketURI.get(0) instead of getAllHttpSocketURI to get a single
test cluster URL rather than a list
Backport: 3057e0f
The update by query action parses a script from an object (map or string). We will need to do the same for runtime fields as they are parsed as part of mappings (#59391).
This commit moves the existing parsing of a script from an object from RestUpdateByQueryAction to the Script class. It also adds tests and adjusts some error messages that are incorrect. Also, options were not parsed before and they are now. And unsupported fields trigger now a deprecation warning.
This commit moves the modules REST tests to the
newly introduced yamlRestTest source set. A few
tests have also been re-named to include the correct
IT suffix. Without changing the names, the testing
conventions task would fail since now that the YAML
tests are no longer present pacify the convention.
These tests have moved to the internalClusterTest
source set.
related: #56841
Backport of #59293 to 7.x branch.
* Create new data-stream xpack module.
* Move TimestampFieldMapper to the new module,
this results in storing a composable index template
with data stream definition only to work with default
distribution. This way data streams can only be used
with default distribution, since a data stream can
currently only be created if a matching composable index
template exists with a data stream definition.
* Renamed `_timestamp` meta field mapper
to `_data_stream_timestamp` meta field mapper.
* Add logic to put composable index template api
to fail if `_data_stream_timestamp` meta field mapper
isn't registered. So that a more understandable
error is returned when attempting to store a template
with data stream definition via the oss distribution.
In a follow up the data stream transport and
rest actions can be moved to the xpack data-stream module.
With the removal of mapping types and the immutability of FieldTypeLookup in #58162, we no longer
have any cause to compare MappedFieldType instances. This means that we can remove all equals
and hashCode implementations, and in addition we no longer need the clone implementations which
were required for equals/hashcode testing. This greatly simplifies implementing new MappedFieldTypes,
which will be particularly useful for the runtime fields project.
The FieldMapper infrastructure currently has a bunch of shared parameters, many of which
are only applicable to a subset of the 41 mapper implementations we ship with. Merging,
parsing and serialization of these parameters are spread around the class hierarchy, with
much repetitive boilerplate code required. It would be much easier to reason about these
things if we could declare the parameter set of each FieldMapper directly in the implementing
class, and share the parsing, merging and serialization logic instead.
This commit is a first effort at introducing a declarative parameter style. It adds a new FieldMapper
subclass, ParametrizedFieldMapper, and refactors two mappers, Boolean and Binary, to use it.
Parameters are declared on Builder classes, with the declaration including the parameter name,
whether or not it is updateable, a default value, how to parse it from mappings, and how to
extract it from another mapper at merge time. Builders have a getParameters method, which
returns a list of the declared parameters; this is then used for parsing, merging and serialization.
Merging is achieved by constructing a new Builder from the existing Mapper, and merging in
values from the merging Mapper; conflicts are all caught at this point, and if none exist then a new,
merged, Mapper can be built from the Builder. This allows all values on the Mapper to be final.
Other mappers can be gradually migrated to this new style, and once they have all been refactored
we can merge ParametrizedFieldMapper and FieldMapper entirely.
Backport of #59076 to 7.x branch.
The commit makes the following changes:
* The timestamp field of a data stream definition in a composable
index template can only be set to '@timestamp'.
* Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and
instead only check that the _timestamp field mapping has been defined on a backing index of a data stream.
* Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method
to `MetadataIndexTemplateService#collectMappings(...)` method.
* Fixed a bug (#58956) that cases timestamp field validation to be performed
for each template and instead of the final mappings that is created.
* only apply _timestamp meta field if index is created as part of a data stream or data stream rollover,
this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition.
Relates to #58642
Relates to #53100Closes#58956Closes#58583
This makes a `parentCardinality` available to every `Aggregator`'s ctor
so it can make intelligent choices about how it collects bucket values.
This replaces `collectsFromSingleBucket` and is similar to it but:
1. It supports `NONE`, `ONE`, and `MANY` values and is generally
extensible if we decide we can use more precise counts.
2. It is more accurate. `collectsFromSingleBucket` assumed that all
sub-aggregations live under multi-bucket aggregations. This is
normally true but `parentCardinality` is properly carried forward
for single bucket aggregations like `filter` and for multi-bucket
aggregations configured in single-bucket for like `range` with a
single range.
While I was touching every aggregation I renamed `doCreateInternal` to
`createMapped` because that seemed like a much better name and it was
right there, next to the change I was already making.
Relates to #56487
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
In order to ensure that we do not write a broken piece of `RepositoryData`
because the phyiscal repository generation was moved ahead more than one step
by erroneous concurrent writing to a repository we must check whether or not
the current assumed repository generation exists in the repository physically.
Without this check we run the risk of writing on top of stale cached repository data.
Relates #56911
This commit creates a new Gradle plugin to provide a separate task name
and source set for running YAML based REST tests. The only project
converted to use the new plugin in this PR is distribution/archives/integ-test-zip.
For which the testing has been moved to :rest-api-spec since it makes the most
sense and it avoids a small but awkward change to the distribution plugin.
The remaining cases in modules, plugins, and x-pack will be handled in followups.
This plugin is distinctly different from the plugin introduced in #55896 since
the YAML REST tests are intended to be black box tests over HTTP. As such they
should not (by default) have access to the classpath for that which they are testing.
The YAML based REST tests will be moved to separate source sets (yamlRestTest).
The which source is the target for the test resources is dependent on if this
new plugin is applied. If it is not applied, it will default to the test source
set.
Further, this introduces a breaking change for plugin developers that
use the YAML testing framework. They will now need to either use the new source set
and matching task, or configure the rest resources to use the old "test" source set that
matches the old integTest task. (The former should be preferred).
As part of this change (which is also breaking for plugin developers) the
rest resources plugin has been removed from the build plugin and now requires
either explicit application or application via the new YAML REST test plugin.
Plugin developers should be able to fix the breaking changes to the YAML tests
by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests
under a yamlRestTest folder (instead of test)
Backport of #58582 to 7.x branch.
This commit adds a new metadata field mapper that validates,
that a document has exactly a single timestamp value in the data stream timestamp field and
that the timestamp field mapping only has `type`, `meta` or `format` attributes configured.
Other attributes can affect the guarantee that an index with this meta field mapper has a
useable timestamp field.
The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever
a new backing index of a data stream is created.
Relates to #53100
Currently we are leaving the settings to default port range in the nio
and netty4 http server test. This has recently led to tests failing due
to what appears to be a port conflict with other processes. This commit
modifies these tests to use the test case helper method to generate port
ranges.
Fixes#58433 and #58296.
The refactoring in #57666 inadvertently enabled norms on two of the percolator subfields,
leading to an increase in memory usage. This commit disables norms on these fields again.
Date processor was incorrectly parsing week based dates because when a
weekbased year was provided ingest module was thinking year was not
on a date and was trying to applying the logic for dd/MM type of
dates.
Date Processor is also allowing users to specify locale parameter. It
should be taken into account when parsing dates - currently only used
for formatting. If someone specifies 'en-us' locale, then calendar data
rules for that locale should be used.
The exception is iso8601 format. If someone is using that format,
then locale should not override calendar data rules.
closes#58479
Restoring from a snapshot (which is a particular form of recovery) does not currently take recovery throttling into account
(i.e. the `indices.recovery.max_bytes_per_sec` setting). While restores are subject to their own throttling (repository
setting `max_restore_bytes_per_sec`), this repository setting does not allow for values to be configured differently on a
per-node basis. As restores are very similar in nature to peer recoveries (streaming bytes to the node), it makes sense to
configure throttling in a single place.
The `max_restore_bytes_per_sec` setting is also changed to default to unlimited now, whereas previously it was set to
`40mb`, which is the current default of `indices.recovery.max_bytes_per_sec`). This means that no behavioral change
will be observed by clusters where the recovery and restore settings were not adapted.
Relates https://github.com/elastic/elasticsearch/issues/57023
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Replace compile configuration usage with api (#58451)
- Use java-library instead of plugin to allow api configuration usage
- Remove explicit references to runtime configurations in dependency declarations
- Make test runtime classpath input for testing convention
- required as java library will by default not have build jar file
- jar file is now explicit input of the task and gradle will ensure its properly build
* Fix compile usages in 7.x branch
Rather than let ExtensiblePlugins know extending plugins' classloaders,
we now pass along an explicit ExtensionLoader that loads the extensions
asked for. Extensions constructed that way can optionally receive their
own Plugin instance in the constructor.
Today we have individual settings for configuring node roles such as
node.data and node.master. Additionally, roles are pluggable and we have
used this to introduce roles such as node.ml and node.voting_only. As
the number of roles is growing, managing these becomes harder for the
user. For example, to create a master-only node, today a user has to
configure:
- node.data: false
- node.ingest: false
- node.remote_cluster_client: false
- node.ml: false
at a minimum if they are relying on defaults, but also add:
- node.master: true
- node.transform: false
- node.voting_only: false
If they want to be explicit. This is also challenging in cases where a
user wants to have configure a coordinating-only node which requires
disabling all roles, a list which we are adding to, requiring the user
to keep checking whether a node has acquired any of these roles.
This commit addresses this by adding a list setting node.roles for which
a user has explicit control over the list of roles that a node has. If
the setting is configured, the node has exactly the roles in the list,
and not any additional roles. This means to configure a master-only
node, the setting is merely 'node.roles: [master]', and to configure a
coordinating-only node, the setting is merely: 'node.roles: []'.
With this change we deprecate the existing 'node.*' settings such as
'node.data'.
Netty4HttpServerTransportTests has started to fail intermittently. It
seems like unexpected successful responses are being received when the
test is simulating errors. This commit adds logging to the test to
provide additional information when there is an unexpected success. It
also adds the logging to the nio http test.
Similarities only apply to a few text-based field types, but are currently set directly on
the base MappedFieldType class. This commit moves similarity information into
TextSearchInfo, and removes any mentions of it from MappedFieldType or FieldMapper.
It was previously possible to include a similarity parameter on a number of field types
that would then ignore this information. To make it obvious that this has no effect, setting
this parameter on non-text field types now issues a deprecation warning.
Now that MappedFieldType no longer extends lucene's FieldType, we need to have a
way of getting the index information about a field necessary for building text queries,
building term vectors, highlighting, etc. This commit introduces a new TextSearchInfo
abstraction that holds this information, and a getTextSearchInfo() method to
MappedFieldType to make it available. Field types that do not support text search can
just return null here.
This allows us to remove the MapperService.getLuceneFieldType() shim method.
Fixes a bug in TextFieldMapper serialization when index is false, and adds a
base-class test to ensure that all field mappers are tested against all variations
with defaults both included and excluded.
Fixes#58188
This is currently used to set the indexVersionCreated parameter on FieldMapper.
However, this parameter is only actually used by two implementations, and clutters
the API considerably. We should just remove it, and use it directly in the
implementations that require it.
MappedFieldType is a combination of two concerns:
* an extension of lucene's FieldType, defining how a field should be indexed
* a set of query factory methods, defining how a field should be searched
We want to break these two concerns apart. This commit is a first step to doing this, breaking
the inheritance relationship between MappedFieldType and FieldType. MappedFieldType
instead has a series of boolean flags defining whether or not the field is searchable or
aggregatable, and FieldMapper has a separate FieldType passed to its constructor defining
how indexing should be done.
Relates to #56814
This commit adds an optional field, `description`, to all ingest processors
so that users can explain the purpose of the specific processor instance.
Closes#56000.
This commit introduces an optimization for inline scripts.
It keeps the compiled ingest script that the ScriptProcessor.Factory
has been creating for validation purposes. Previously, the Script Service's
cache was leveraged because it was the best way to handle caching of both
stored and inline scripts. Since inline scripts are so widely used in
Ingest Node, it is probably best to ensure we are using the pre-compiled version
from the beginning.
* Remove usage of deprecated testCompile configuration
* Replace testCompile usage by testImplementation
* Make testImplementation non transitive by default (as we did for testCompile)
* Update CONTRIBUTING about using testImplementation for test dependencies
* Fail on testCompile configuration usage
Backport of #57870 to 7.x branch.
This change now also copies the op_type from the reindex request's destination index request to the actual index request being used in the bulk request.
For ensuring no document exists, the op_type create doesn't need to be copied, since Versions.MATCH_DELETED will copied from the 'mainRequest.getDestination().version()'.
The `version()` method on IndexRequest only returns Versions.MATCH_DELETED if op_type=create and no specific version has been specified.
However in order to be able to index into a data stream, the op_type must be create. So in order to support that the op_type must be copied from the reindex request's destination index request to the actual index request being used in the bulk request.
Relates to #53100 and #57788
Reworks the `parent` and `child` aggregation are not at the top level
using the optimization from #55873. Instead of wrapping all
non-top-level `parent` and `child` aggregators we now handle being a
child aggregator in the aggregator, specifically by adding recording
which global ordinals show up in the parent and then checking if they
match the child.
Adds assertions to Netty to make sure that its threads are not polluted by thread contexts (and
also that thread contexts are not leaked). Moves the ClusterApplierService to use the system
context (same as we do for MasterService), which allows to remove a hack from
TemplateUgradeService and makes it clearer that applying CS updates is fully executing under
system context.
When Joni, the regex engine that powers grok emits a warning it
does so by default to System.err. System.err logs are all bucketed
together in the server log at WARN level. When Joni emits a warning,
it can be extremely verbose, logging a message for each execution
again that pattern. For ingest node that means for every document
that is run that through Grok. Fortunately, Joni provides a call
back hook to push these warnings to a custom location.
This commit implements Joni's callback hook to push the Joni warning
to the Elasticsearch server logger (logger.org.elasticsearch.ingest.common.GrokProcessor)
at debug level. Generally these warning indicate a possible issue with
the regular expression and upon creation of the Grok processor will
do a "test run" of the expression and log the result (if any) at WARN
level. This WARN level log should only occur on pipeline creation which
is a much lower frequency then every document.
Additionally, the documentation is updated with instructions for how
to set the logger to debug level.
Allow for optimistic concurrency control during ingest by checking the
sequence number and primary term. This is accomplished by defining
_if_seq_no and _if_primary_term in the pipeline, similarly to _version
and _version_type.
Closes#41255
Co-authored-by: Maria Ralli <mariai.ralli@gmail.com>
Adds assertions to Netty to make sure that its threads are not polluted by thread contexts (and
also that thread contexts are not leaked). Moves the ClusterApplierService to use the system
context (same as we do for MasterService), which allows to remove a hack from
TemplateUgradeService and makes it clearer that applying CS updates is fully executing under
system context.
Before to determine if a field is meta-field, a static method of MapperService
isMetadataField was used. This method was using an outdated static list
of meta-fields.
This PR instead changes this method to the instance method that
is also aware of meta-fields in all registered plugins.
Related #38373, #41656Closes#24422
Follow up to #56961:
We can be a little more efficient than just serializing at the IO loop by serializing
only when we flush to a channel. This has the advantage that we don't serialize a long
queue of messages for a channel that isn't writable for a longer period of time (unstable network,
actually writing large volumes of data, etc.).
Also, this further reduces the time for which we hold on to the write buffer for a message,
making allocations because of an empty page cache recycler pool less likely.
Pulls the way that the `ParentJoinAggregator` collects global ordinals
into a strategy object so it is a little simpler to reason about and
it'll be simpler to save memory by removing `asMultiBucketAggregator` in
the future.
Relates to #56487
Almost every outbound message is serialized to buffers of 16k pagesize.
We were serializing these messages off the IO loop (and retaining the concrete message
instance as well) and would then enqueue it on the IO loop to be dealt with as soon as the
channel is ready.
1. This would cause buffers to be held onto for longer than necessary, causing less reuse on average.
2. If a channel was slow for some reason, not only would concrete message instances queue up for it, but also 16k of buffers would be reserved for each message until it would be written+flushed physically.
With this change, the serialization happens on the event loop which effectively limits the number of buffers that `N` IO-threads will ever use so long as messages are small and channels writable.
Also, this change dereferences the reference to the concrete outbound message as soon as it has been serialized to save some more on GC.
This reduces the GC time for a default PMC run by about 50% in experiments (3 nodes, 2G heap each, loopback ... obvious caveat is that GC isn't that heavy in the first place with recent changes but still a measurable gain).
I also expect it to be helpful for master node stability by causing less of a spike if master is e.g. hit by a large number of requests that are processed batched (e.g. shard snapshot status updates) and responded to in a short time frame all at once.
Obviously, the downside to this change is that it introduces more latency on the IO loop for the serialization. But since we read all of these messages on the IO loop as well I don't see it as much of a qualitative change really and the more predictable buffer use seems much more valuable relatively.
Previously we'd get a `ClassCastException` when you tried to use
`numeric_type` on `scaled_float`. Oops! This cleans up the CCE and moves
some code around so the casting actually works.
This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.
This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
Closes#56113
When the parameter `max_docs` is less than `slices` in update_by_query,
delete_by_query or reindex API, `max_docs ` is set to 0 and we throw an
action_request_validation_exception with confused error message:
"maxDocs should be greater than 0...".
This change checks that whether `max_docs` is less than `slices` and
throw an illegal_argument_exception with clear message.
Relates to #52786.
Co-authored-by: bellengao <gbl_long@163.com>
When we had multiple mapping types, an update to a field in one type had to be
propagated to the same field in all other types. This was done using the
Mapper.updateFieldType() method, called at the end of a merge. However, now
that we only have a single type per index, this method is unnecessary and can
be removed.
Relates to #41059
Backport of #56986
We don't need to hold on to the request body past the beginning of sending
the response. There is no need to keep a reference to it until after the response
has been sent fully and we can eagerly release it here.
Note, this can be optimized further to release the contents even earlier but for now
this is an easy increment to saving some memory on the IO pool.
* Update DeprecationMap to DynamicMap (#56149)
This renames DeprecationMap to DynamicMap, and changes the deprecation
messages Map to accept a Map of String (keys) to Functions (updated values)
instead. This creates more flexibility in either logging or updating values from
params within a script. This change is required to fix (#52103) in a future PR.
* Fix Source Return Bug in Scripting (#56831)
This change ensures that when a user returns _source directly no matter where
accessed within scripting, the value is a Map of the converted source as
opposed to a SourceLookup.
Merging logic is currently split between FieldMapper, with its merge() method, and
MappedFieldType, which checks for merging compatibility. The compatibility checks
are called from a third class, MappingMergeValidator. This makes it difficult to reason
about what is or is not compatible in updates, and even what is in fact updateable - we
have a number of tests that check compatibility on changes in mapping configuration
that are not in fact possible.
This commit refactors the compatibility logic so that it all sits on FieldMapper, and
makes it called at merge time. It adds a new FieldMapperTestCase base class that
FieldMapper tests can extend, and moves the compatibility testing machinery from
FieldTypeTestCase to here.
Relates to #56814
Elasticsearch requires that a HttpRequest abstraction be implemented
by http modules before server processing. This abstraction controls when
underlying resources are released. This commit moves this abstraction to
be created immediately after content aggregation. This change will
enable follow-up work including moving Cors logic into the server
package and tracking bytes as they are aggregated from the network
level.
In most cases we are seeing a `PooledHeapByteBuf` here now. No need to
redundantly create an new `ByteBuffer` and single element array for it
here when we can just directly unwrap its internal `byte[]`.
Mapper.Builder currently has some complex generics on it to allow fluent builder
construction. However, the second parameter, a return type from the build() method,
is unnecessary, as we can use covariant return types. This commit removes this second
generic parameter.
This is another part of the breakup of the massive BuildPlugin. This PR
moves the code for configuring publications to a separate plugin. Most
of the time these publications are jar files, but this also supports the
zip publication we have for integ tests.
We never do any file IO or other blocking work on the transport threads
so no tangible benefit can be derived from using more threads than CPUs
for IO.
There are however significant downsides to using more threads than necessary
with Netty in particular. Since we use the default setting for
`io.netty.allocator.useCacheForAllThreads` which is `true` we end up
using up to `16MB` of thread local buffer cache for each transport thread.
Meaning we potentially waste CPUs * 16MB of heap for unnecessary IO threads in addition to obvious inefficiencies of artificially adding extra context switches.
This PR proposes to use `IndexSortSortedNumericDocValuesRangeQuery` when
possible to speed up certain range queries. Points-based queries are already
very efficient, the only time this query makes a difference is when the range
matches a large number of documents.
Relates to #48665.
If a conditional is added to a processor, and that processor fails, and
that processor has an on_failure handler, the full trace of all of the
executed processors may not be displayed in simulate verbose. The
information is correct, but misses displaying some of the steps used
to get there.
This happens because a processor that is conditional processor is a
wrapper around the real processor and a processor with an on_failure
handler is also a wrapper around the processor(s). When decorating for
simulation we treat compound processor specially, but if a compound processor
is wrapped by a conditional processor that compound processor's processors
can be missed for decoration resulting in the missing displayed steps.
The fix to this is to treat the conditional processor specially and
explicitly seperate it from the processor it is wrapping. This requires
us to keep track of 2 processors a possible conditional processor and
the actual processor it may be wrapping.
related: #56004
Currently Elasticsearch creates independent event loop groups for each
transport (http and internal) transport type. This is unnecessary and
can lead to contention when different threads access shared resources
(ex: allocators). This commit moves to a model where, by default, the
event loops are shared between the transports. The previous behavior can
be attained by specifically setting the http worker count.
Right now all implementations of the `terms` agg allocate a new
`Aggregator` per bucket. This uses a bunch of memory. Exactly how much
isn't clear but each `Aggregator` ends up making its own objects to read
doc values which have non-trivial buffers. And it forces all of it
sub-aggregations to do the same. We allocate a new `Aggregator` per
bucket for two reasons:
1. We didn't have an appropriate data structure to track the
sub-ordinals of each parent bucket.
2. You can only make a single call to `runDeferredCollections(long...)`
per `Aggregator` which was the only way to delay collection of
sub-aggregations.
This change switches the method that builds aggregation results from
building them one at a time to building all of the results for the
entire aggregator at the same time.
It also adds a fairly simplistic data structure to track the sub-ordinals
for `long`-keyed buckets.
It uses both of those to power numeric `terms` aggregations and removes
the per-bucket allocation of their `Aggregator`. This fairly
substantially reduces memory consumption of numeric `terms` aggregations
that are not the "top level", especially when those aggregations contain
many sub-aggregations. It also is a pretty big speed up, especially when
the aggregation is under a non-selective aggregation like
the `date_histogram`.
I picked numeric `terms` aggregations because those have the simplest
implementation. At least, I could kind of fit it in my head. And I
haven't fully understood the "bytes"-based terms aggregations, but I
imagine I'll be able to make similar optimizations to them in follow up
changes.
Another Jackson release is available. There are some CVEs addressed,
none of which impact us, but since we can now bump Jackson easily, let
us move along with the train to avoid the false positives from security
scanners.
`FieldMapper#parseCreateField` accepts the parse context, plus a list of fields
as an output parameter. These fields are immediately added to the document
through `ParseContext#doc()`.
This commit simplifies the signature by removing the list of fields, and having
the mappers add the fields directly to `ParseContext#doc()`. I think this is
nicer for implementors, because previously fields could be added either through
the list, or the context (through `add`, `addWithKey`, etc.)
Backport of #56034.
Move includeDataStream flag from an IndicesOptions to IndexNameExpressionResolver.Context
as a dedicated field that callers to IndexNameExpressionResolver can set.
Also alter indices stats api to support data streams.
The rollover api uses this api and otherwise rolling over data stream does no longer work.
Relates to #53100
* Emit deprecation warning if multiple v1 templates match with a new index (#55558)
* Emit deprecation warning if multiple v1 templates match with a new index
* DEPRECATION_LOGGER rename
* Fix empty_value handling in CsvProcessor
Due to bug in `CsvProcessor.Factory` it was impossible to specify `empty_value`.
This change fixes that and adds relevant test.
Closes#55643
* assert changed
The Lucene `preserve_original` setting is currently not supported in the `edge_ngram`
token filter. This change adds it with a default value of `false`.
Closes#55767
Currently there is a clear mechanism to stub sending a request through
the transport. However, this is limited to testing exceptions on the
sender side. This commit reworks our transport related testing
infrastructure to allow stubbing request handling on the receiving side.
* Simplify java home verification
At one time, all uses of java home were found through the getJavaHome
utility method on BuildPlugin. However, that was changed many
refactorings ago, but the complex support for registering a java home
version needed that fails at configuration time still exists. The only
remaining use of grabbing java home is within bwc tests, and must be at
runtime since that is when we have the checkout and know what version is
needed.
This commit consolidates the java home finding method into a utility
unassociated with BuildPlugin.
* fix checkstyle
* address feedback