Today we have a few problems with how we handle bad requests:
- handling requests with bad encoding
- handling requests with invalid value for filter_path/pretty/human
- handling requests with a garbage Content-Type header
There are two problems:
- in every case, we give an empty response to the client
- in most cases, we leak the byte buffer backing the request!
These problems are caused by a broader problem: poor handling preparing
the request for handling, or the channel to write to when the response
is ready. This commit addresses these issues by taking a unified
approach to all of them that ensures that:
- we respond to the client with the exception that blew us up
- we do not leak the byte buffer backing the request
We historically removed reading from the transaction log to get consistent
results from _GET calls. There was also the motivation that the read-modify-update
principle we apply should not be hidden from the user. We still agree on the fact
that we should not hide these aspects but the impact on updates is quite significant
especially if the same documents is updated before it's written to disk and made serachable.
This change adds back the ability to read from the transaction log but only for update calls.
Calls to the _GET API will always do a refresh if necessary to return consistent results ie.
if stored fields or DocValues Fields are requested.
Closes#26802
When the `BulkProcessor` is used with the high-level REST client, a scheduler is internally created that allows to schedule tasks. Such scheduler is not exposed to users and needs to be closed once the `BulkProcessor` is closed. There are two ways to close the `BulkProcessor` though, one is the ordinary `close` method and the other one is `awaitClose`. The former closes the scheduler while the latter doesn't, leaving threads lingering.
We previously had a property to specify the location of the REST test
spec files but this was removed in a previous refactoring yet left
behind in the docs. This commit removes the last remaining vestige of
this parameter.
DocumentParser: The checks for Text and Keyword were masked by the
earlier check for String, which they are child classes of. As String
field types are no longer supported, this check can be removed.
Restoring a snapshot, or getting the status of finished
snapshots, currently always load the global state metadata
file from the repository even if it not required. This
slows down the restore process (or listing statuses process)
and can also be an issue if the global state cannot be
deserialized (because it has unknown customs for example).
This commit splits the Repository.getSnapshotMetadata()
method into two distincts methods: getGlobalMetadata()
and getIndexMetadata() that are now called only when needed.
When a module or plugin register that it has a client JAR, we copy
artifacts like the Javadoc and sources JARs as the JARs for the client
as well (with -client added to the name). I previously had to disable
the Javadoc task on JDK 10 due to a bug in bin/javadoc. After JDK 10
went GA without a fix for this bug, I added workaround to fix the
Javadoc task on JDK 10. However, I made a mistake reverting the
previously skipped Javadocs tasks and missed that one that copies the
Javadoc JAR for client JARs. This commit fixes that issue.
* Decouple NamedXContentRegistry from ElasticsearchException
This commit decouples `NamedXContentRegistry` from using either
`ElasticsearchException`, `ParsingException`, or `UnknownNamedObjectException`.
This will allow us to move NamedXContentRegistry to its own lib as part of the
xcontent extraction work.
Relates to #28504
HttpInfo is passed the maxContentLength as a parameter, but this value should
never be negative. This fixes the test to only pass a positive random value.
* Remove all dependencies from XContentBuilder
This commit removes all of the non-JDK dependencies from XContentBuilder, with
the exception of `CollectionUtils.ensureNoSelfReferences`. It adds a third
extension point around dealing with time-based fields and formatters to work
around the Joda dependency.
This decoupling allows us to be able to move XContentBuilder to a separate lib
so it can be available for things like the high level rest client.
Relates to #28504
In 5.2 `ignore_unmapped` was added to `inner_hits` in order to ignore invalid mapping.
This value was automatically set to the value defined in the parent query (`nested`, `has_child`, `has_parent`) but the refactoring of the parent/child in 5.6 removed this behavior unintentionally.
This commit restores this behavior but also makes sure that we always automatically enforce this value when the query builder is used directly (previously this was only done by the XContent deserialization).
Closes#29071
The default timeout (eg. 10 seconds) may not be enough for CI to
re-allocate shards after the partion is healed. This commit increases
the timeout to 30 seconds and enables logging in order to have more
detailed information in case this test failed again.
Closes#29060
This was the plan from day one but due to a silly bug nodes were immediately retried after they were marked as dead for the first time. From the second time on, the expected backoff was applied.
In #testPruneOnlyDeletesAtMostLocalCheckpoint, we create a new engine
but mistakenly use the same translog directory of the existing engine.
This prevents translog files from cleaning up when closing the engines.
ERROR 0.12s J2 | InternalEngineTests.testPruneOnlyDeletesAtMostLocalCheckpoint <<< FAILURES!
> Throwable #1: java.io.IOException: could not remove the following files (in the order of attempts):
> translog-primary-060/translog-2.tlog: java.io.IOException: access denied:
This commit makes sure to use a separate directory for each engine in
this tes.
When processing an append-only operation, primary knows that operations
can only conflict with another instance of the same operation. This is
true as the id was freshly generated. However this property doesn't hold
for replicas. As soon as an auto-generated ID was indexed into the
primary, it can be exposed to a search and users can issue a follow up
operation on it. In extremely rare cases, the follow up operation can be
arrived and processed on a replica before the original append-only
request. In this case we can't simply proceed with the append-only
request and blindly add it to the index without consulting the version
map.
The following scenario can cause difference between primary and
replica.
1. Primary indexes an auto-gen-id doc. (id=X, v=1, s#=20)
2. A refresh cycle happens on primary
3. The new doc is picked up and modified - say by a delete by query
request - Primary gets a delete doc (id=X, v=2, s#=30)
4. Delete doc is processed first on the replica (id=X, v=2, s#=30)
5. Indexing operation arrives on the replica, since it's an auto-gen-id
request and the retry marker is lower, we put it into lucene without
any check. Replica has a doc the primary doesn't have.
To deal with a potential conflict between an append-only operation and a
normal operation on replicas, we need to rely on sequence numbers. This
commit maintains the max seqno of non-append-only operations on replica
then only apply optimization for an append-only operation only if its
seq# is higher than the seq# of all non-append-only.
The vagrant test plugin adds tasks for the groovy packaging tests,
which run after the bats packaging test tasks.Rename the 'bats'
configuration to 'packaging' and remove the option to inherit
archives from this configuration.
Once a document is deleted and Lucene is refreshed, we will not be able
to look up the `version/seq#` associated with that delete in Lucene. As
conflicting operations can still be indexed, we need another mechanism
to remember these deletes. Therefore deletes should still be stored in
the Version Map, even after Lucene is refreshed. Obviously, we can't
remember all deletes forever so a trimming mechanism is needed.
Currently, we remember deletes for at least 1 minute (the default GC
deletes cycle) and clean them periodically. This is, at the moment, the
best we can do on the primary for user facing APIs but this arbitrary
time limit is problematic for replicas. Furthermore, we can't rely on
the primary and replicas doing the trimming in a synchronized manner,
and failing to do so results in the replica and primary making different
decisions.
The following scenario can cause inconsistency between
primary and replica.
1. Primary index doc (index, id=1, v2)
2. Network packet issue causes index operation to back off and wait
3. Primary deletes doc (delete, id=1, v3)
4. Replica processes delete (delete, id=1, v3)
5. 1+ minute passes (GC deletes runs replica)
6. Indexing op is finally sent to the replica which no processes it
because it forgot about the delete.
We can reply on sequence-numbers to prevent this issue. If we prune only
deletes whose seqno at most the local checkpoint, a replica will
correctly remember what it needs. The correctness is explained as
follows:
Suppose o1 and o2 are two operations on the same document with seq#(o1)
< seq#(o2), and o2 arrives before o1 on the replica. o2 is processed
normally since it arrives first; when o1 arrives it should be discarded:
1. If seq#(o1) <= LCP, then it will be not be added to Lucene, as it was
already previously added.
2. If seq#(o1) > LCP, then it depends on the nature of o2:
- If o2 is a delete then its seq# is recorded in the VersionMap,
since seq#(o2) > seq#(o1) > LCP, so a lookup can find it and
determine that o1 is stale.
- If o2 is an indexing then its seq# is either in Lucene (if
refreshed) or the VersionMap (if not refreshed yet), so a
real-time lookup can find it and determine that o1 is stale.
In this PR, we prefer to deploy a single trimming strategy, which
satisfies both requirements, on primary and replicas because:
- It's simpler - no need to distinguish if an engine is running at
primary mode or replica mode or being promoted.
- If a replica subsequently is promoted, user experience is fully
maintained as that replica remembers deletes for the last GC cycle.
However, the version map may consume less memory if we deploy two
different trimming strategies for primary and replicas.
testUnassignedShardAndEmptyNodesInRoutingTable and that test is as old as time and does a very bogus thing.
it is an IT test which extracts the GatewayAllocator from the node and tells it to allocated unassigned
shards, while giving it a conjured cluster state with no nodes in it (it uses the DiscoveryNodes.EMPTY_NODES.
This is never a cluster state we want to reroute on (we always have at least master node in it).
I'm going to just delete the test as I don't think it adds much value.
Closes#21463
#28245 has introduced the utility class`EngineDiskUtils` with a set of methods to prepare/change
translog and lucene commit points. That util class bundled everything that's needed to create and
empty shard, bootstrap a shard from a lucene index that was just restored etc.
In order to safely do these manipulations, the util methods acquired the IndexWriter's lock. That
would sometime fail due to concurrent shard store fetching or other short activities that require the
files not to be changed while they read from them.
Since there is no way to wait on the index writer lock, the `Store` class has other locks to make
sure that once we try to acquire the IW lock, it will succeed. To side step this waiting problem, this
PR folds `EngineDiskUtils` into `Store`. Sadly this comes with a price - the store class doesn't and
shouldn't know about the translog. As such the logic is slightly less tight and callers have to do the
translog manipulations on their own.
Some source files seem to have the execute bit (a+x) set, which doesn't
really seem to hurt but is a bit odd. This change removes those, making
the permissions similar to other source files in the repository.
This change refactors the composite aggregation to add an execution mode that visits documents in the order of the values
present in the leading source of the composite definition. This mode does not need to visit all documents since it can early terminate
the collection when the leading source value is greater than the lowest value in the queue.
Instead of collecting the documents in the order of their doc_id, this mode uses the inverted lists (or the bkd tree for numerics) to collect documents
in the order of the values present in the leading source.
For instance the following aggregation:
```
"composite" : {
"sources" : [
{ "value1": { "terms" : { "field": "timestamp", "order": "asc" } } }
],
"size": 10
}
```
... can use the field `timestamp` to collect the documents with the 10 lowest values for the field instead of visiting all documents.
For composite aggregation with more than one source the execution can early terminate as soon as one of the 10 lowest values produces enough
composite buckets. For instance if visiting the first two lowest timestamp created 10 composite buckets we can early terminate the collection since it
is guaranteed that the third lowest timestamp cannot create a composite key that compares lower than the one already visited.
This mode can execute iff:
* The leading source in the composite definition uses an indexed field of type `date` (works also with `date_histogram` source), `integer`, `long` or `keyword`.
* The query is a match_all query or a range query over the field that is used as the leading source in the composite definition.
* The sort order of the leading source is the natural order (ascending since postings and numerics are sorted in ascending order only).
If these conditions are not met this aggregation visits each document like any other agg.
The rank_eval documentation was missing an explanation of the parameter
`k` that controls the number of top hits that are used in the ranking evaluation.
Closes#29205
Adds docs for `HighLevelRestClient#multiSearch`. Unlike the `multiGet`
docs these are much more sparse because multi-search doesn't support
setting many options on the `MultiSearchRequest` and instead just wraps
a list of `SearchRequest`s.
Closes#28389
This enhancement adds Z value support (source only) to geo_shape fields. If vertices are provided with a third dimension, the third dimension is ignored for indexing but returned as part of source. Like beofre, any values greater than the 3rd dimension are ignored.
closes#23747
This commit removes type-casts in logging in the server component (other
components will be done later). This also adds a parameterized message
test which would catch breaking-changes related to lambdas in Log4J.
While working on #27799, we find that it might make sense to change BroadcastResponse from ToXContentFragment to ToXContentObject, seeing that it's rather a complete XContent object and also the other Responses are normally ToXContentObject.
By doing this, we can also move the XContent build logic of BroadcastResponse's subclasses, from Rest Layer to the concrete classes themselves.
Relates to #3889
This commit adds a note to the low-level REST client docs regarding the
possibility of being impacted by the JVM DNS cache policy under a
default security manager policy.
In #28350, we fixed an endless flushing loop which may happen on
replicas by tightening the relation between the flush action and the
periodically flush condition.
1. The periodically flush condition is enabled only if it is disabled
after a flush.
2. If the periodically flush condition is enabled then a flush will
actually happen regardless of Lucene state.
(1) and (2) guarantee that a flushing loop will be terminated. Sadly,
the condition 1 can be violated in edge cases as we used two different
algorithms to evaluate the current and future uncommitted translog size.
- We use method `uncommittedSizeInBytes` to calculate current
uncommitted size. It is the sum of translogs whose generation at least
the minGen (determined by a given seqno). We pick a continuous range of
translogs since the minGen to evaluate the current uncommitted size.
- We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future
uncommitted size. It is the sum of translogs whose maxSeqNo at least
the given seqNo. Here we don't pick a range but select translog one by
one.
Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and
seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3
while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is
excluded because its maxSeqno is still -1.
This commit removes both `sizeOfGensAboveSeqNoInBytes` and
`uncommittedSizeInBytes` methods, then enforces an engine to use only
`sizeInBytesByMinGen` method to evaluate the periodically flush condition.
Closes#29097
Relates ##28350
This commit removes some parameters deprecated in 6.x (or 5.x):
`use_dismax`, `split_on_whitespace`, `all_fields` and `lowercase_expanded_terms`.
Closes#25551
Remove license information from README.textile in preparation for the
the inclusion of X-Pack into the main elasticsearch repository.
This is necessary to avoid confusion around licensing, as the entire
repository will no longer be Apache 2.0.
This commit decouples `BytesRef`, `Releaseable`, and `TimeValue` from
XContentBuilder, and paves the way for doupling `ByteSizeValue` as well. It
moves much of the Lucene and Joda encoding into a new SPI extension that is
loaded by XContentBuilder to know how to encode these values.
Part of doing this also allows us to make JSON encoding strict, as we no longer
allow just any old object to be passed (in the past it was possible to get json
that was `"field": "java.lang.Object@d8355a8"` if no one was careful about what
was passed in).
Relates to #28504