Reindex-from-remote had a race when it tried to clear the scroll. It
first starts the request to clear the scroll and then submits a task
to the generic threadpool to shutdown the client. These two things
race and, in my experience, closing the scroll generally loses. That
means that most of the time reindex-from-remote isn't clearing the
scrolls that it uses. This isn't the end of the world because we
flush old scroll contexts after a while but this isn't great.
Noticed while experimenting with #22514.
This commit updates the cluster allocation explain API documentation to
explain the new request parameters and response formats, and gives
examples of the explain API responses under various scenarios.
Affix settings are useful to namespace a certain setting. Yet, affix settings
must be specialized for their concrete type which causes lot of code duplication.
This commit allows to reuse an existing setting with and affix setting as soon as
a concrete key is available.
Reindex-from-remote was accepting source filtering in the request
but ignoring it and setting `_source=true` on the search URI. This
fixes the filtering so it is piped through to the remote node and
adds tests for that.
Closes#22507
One needs to close the higher level objects (like UnicastZenPing) before closing the transport service. The latter can throw assertions w.r.t open connections
This adds methods to parse InternalSearchHit and InternalSearchHits from their
xContent representation. Most of the information in the original object is
preserved when rendering the object to xContent and then parsing it back.
However, some pieces of information are lost which we currently cannot parse
back from the rest response, most notably:
* the "match" property of the lucene explanation is not rendered in the
"_explain" section and cannot be reconstructed on the client side
* the original "shard" information (SearchShardTarget) is only rendered if the
"explanation" is also set, also we loose the indexUUID of the contained
ShardId because we don't write it out. As a replacement we can use
ClusterState.UNKNOWN_UUID on the receiving side
The NodeConnectionsService currently determines which nodes to connect to / disconnect from by inspecting cluster state changes and connecting to added nodes / disconnecting from removed nodes. When a master steps down (for example due to another master-eligible node shutting down which brings the number of master-eligible nodes below minimum_master_master), and the connection to other existing nodes was dropped while pinging, however, the connection to these nodes is not re-established while publishing the first cluster state that establishes the node as master.
This commit changes the NodeConnectionsService connect / disconnect logic to always rely on the state that is to be / was published, looking not only at the added / removed nodes, but validating that exactly all nodes that are currently registered in NodeConnectionsService are connected (corresponds to a NOOP if the node is already connected).
This commit adds a document describing our data replication model in high level terms. The goal is give people basic insight into how things work in order to better understand how read and writes interact, both during normal operations and under failures.
The document in the randomized GetResult can exist with no source (like if the _source was disabled in mappings), that's why the test should not always expect a non null source when the doc exists.
* Promote longs to doubles when a terms agg mixes decimal and non-decimal number
This change makes the terms aggregation work when the buckets coming from different indices are a mix of decimal numbers and non-decimal numbers. In this case non-decimal number (longs) are promoted to decimal (double) which can result in a loss of precision for big numbers.
Fixes#22232
There is a bug in the error message that is thrown if the number of docs differs between the source and target shards when recovering a shard with a syncId. The source and target doc counts are swapped around.
Closes#21893
Currently, such tasks are only created for default boxes (centos-7, ubuntu-1404) and not all boxes and this can be misleading for developers who want to debug testing scripts on non-default boxes.
Removes `AggregatorParsers`, replacing all of its functionality with
`XContentParser#namedObject`.
This is the third bit of payoff from #22003, one less thing to pass
around the entire application.
If the remote doesn't return a content type then reindex
tried to guess the content-type. This didn't work most
of the time and produced a rather useless error message.
Given that Elasticsearch always returns the content-type
we are dropping content-type detection in favor of just
failing the request if the remote didn't return a content-type.
Closes#22329
The test ping and waited for the ping results to be returned but since we first return the result and then close temporary connections, assertions are tripped that expects all connections to close by end of test .
Closes#22497
This commit checks for a null BytesReference as the value for `source`
in GetResult#sourceRef and simply returns null. Previously this would
have resulted in a NPE. While this does seem internal at first glance, it can affect
user code as a GetResponse could trigger this when the document is missing.
Additionally, the CompressorFactory#uncompressIfNeeded now requires a
non-null argument.
This patch fixes the incorrect indentation in the REST tests, which makes tests in language runners (eg. Ruby, Python) to fail, since the skip clause is parsed as an empty value. Tha Java YAML parser is smarter/lenient about whitespace, so it doesn't catch this.
The recovery process started during primary relocation of shadow replicas accesses the engine on the source shard after it's been closed, which results in the source shard failing itself.
Right now closing a shard looks like it strands refresh listeners,
causing tests like
`delete/50_refresh/refresh=wait_for waits until changes are visible in search`
to fail. Here is a build that fails:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+multi_cluster_search+multijob-darwin-compatibility/4/console
This attempts to fix the problem by implements `Closeable` on
`RefreshListeners` and rejecting listeners when closed. More importantly
the act of closing the instance flushes all pending listeners
so we shouldn't have any stranded listeners on close.
Because it was needed for testing, this also adds the number of
pending listeners to the `CommonStats` object and all API to which
that flows: `_cat/nodes`, `_cat/indices`, `_cat/shards`, and
`_nodes/stats`.
In pre 2.x versions, if the repository was set to compress snapshots,
then snapshots would be compressed with the LZF algorithm. In 5.x,
Elasticsearch no longer supports the LZF compression algorithm. This
presents an issue when retrieving snapshots in a repository or upgrading
repository data to the 5.x version, because Elasticsearch throws an
exception when it tries to read the snapshot metadata because it was
compressed using LZF.
This commit gracefully handles the situation by introducing a new
incompatible-snapshots blob to the repository. For any pre-2.x snapshot
that cannot be read, that snapshot is removed from the list of active
snapshots, because the snapshot could not be restored anyway. Instead,
the snapshot is recorded in the incompatible-snapshots blob. When
listing snapshots, both active snapshots and incompatible snapshots will
be listed, with incompatible snapshots showing a `INCOMPATIBLE` state.
Any attempt to restore an incompatible snapshot will result in an
exception.