On some Linux distributions tmpfiles.d cleans files and
directories under /tmp if they haven't been accessed for
10 days.
This can cause problems for ML as ML is currently the only
component that uses the temp directory more than a few
seconds after startup. If you didn't open an ML job for
10 days and then tried to open one then the temp directory
would have been deleted.
This commit prevents the problem occurring in the case of
Elasticsearch being managed by systemd, as systemd private
temp directories are not subject to periodic cleanup (by
default).
Additionally there are now some docs to warn people about
the risk and suggest a manual mitigation for .tar.gz users.
Primary terms were introduced as part of the sequence-number effort (#10708) and added in ES
5.0. Subsequent work introduced the replication tracker which lets the primary own its replication
group (#25692) to coordinate recovery and replication. The replication tracker explicitly exposes
whether it is operating in primary mode or replica mode, independent of the ShardRouting object
that's associated with a shard. During a primary relocation, for example, the primary mode is
transferred between the primary relocation source and the primary relocation target. After
transferring this so-called primary context, the old primary becomes a replication target and the
new primary the replication source, reflected in the replication tracker on both nodes. With the
most recent PR in this area (#32442), we finally have a clean transition between a shard that's
operating as a primary and issuing sequence numbers and a shard that's serving as a replication
target. The transition from one state to the other is enforced through the operation-permit system,
where we block permit acquisition during such changes and perform the transition under this
operation block, ensuring that there are no operations in progress while the transition is being
performed. This finally allows us to turn the best-effort checks that were put in place to prevent
shards from being used in the wrong way (i.e. primary as replica, or replica as primary) into hard
assertions, making it easier to catch any bugs in this area.
Currently, when TranslogCorruptedException is thrown most of the times it does not contain information about the translog location on the file system. There is the translog recovery tool that accepts the translog path as an argument and users are constantly puzzled where to get the path.
This pull request adds "source" information to every TranslogCorruptedException thrown. The source could be local file, remote translog source (used for recovery), assertion (translog entry is constructed to perform some assertion) or translog constructed inside the test.
Closes#24929
This change adds a check so that when parsing the search source, script fields are
ignored when the requested search result size is 0. This helps with e.g. clients like
Kibana that sends a list of script fields that they may need for convenience, but they
don't require any hits. Before this change, user sometimes ran into confusing behaviour,
e.g. the script compilation limit to breaking although no hits were requested.
Closes#31824
* We were comparing the wrong timeout value in the `randomValueOtherThan` call here, leading to no mutation happening for a certain seed
* closes#32639
Today content type detection on an input stream works by peeking up to
twenty bytes into the stream. If the stream is headed by more whitespace
than twenty bytes, we might fail to detect the content type. We should
be ignoring this whitespace before attempting to detect the content
type. This commit does that by ignoring all leading whitespace in an
input stream before attempting to guess the content type.
Currently, snippets in lists cannot be rendered correctly as a console command because the console command requires a line continuation '+'. This allows snippets to have a line continuation between the snippet and the // CONSOLE.
* INGEST: Fix ThreadWatchDog Throwing on Shutdown
* #32539 is caused by the fact that ThreadWatchDog.Default could throw on shutdown if the ThreadPool is interrupted while `interruptLongRunningExecutions` is in progress. This is a result of the watchdog not having a lifecycle of its own (normally it terminates when the threadpool terminates).
* We can't easily use `org.elasticsearch.common.util.concurrent.EsRejectedExecutionException#isExecutorShutdown` to catch this state the same way other components do since thatwould require adding the core lib to Grok as a dependency
* Since we have no knowledge of the lifecycle in this compontent since we're only passed the scheduler `BiFunction` I fixed this by only scheduling the watchdog when there's actually registered threads in it.
* I think using the patter of locking via two `Atomic*` values should not be much of a performance concern here under load since either the integer will likely be > 0 in this case (because we have multiple Grok in parallel) or the running state will be true because there likely was at least one thread registered when the watchdog ran and so the enqueing of the watchdog task during `register` will happen very rarely here (in the worst case scenario of only a single Grok thread it will happen less frequently than once every `ingest.grok.watchdog.interval`). The atomic update on the count should not be relevant relative to the cost of adding a new node to the CHM either.
* Fixes#32539
* Also fixes the watchdog to run if it doens't have to in general.
The assertion in the test was not broad enough. If the timing is very unlucky, the
shard is already promoted to primary before the indexOnReplica even gets to execute.
Closes#32645
* master:
Cross-cluster search: preserve cluster alias in shard failures (#32608)
Handle AlreadyClosedException when bumping primary term
[TEST] Allow to run in FIPS JVM (#32607)
[Test] Add ckb to the list of unsupported languages (#32611)
SCRIPTING: Move Aggregation Scripts to their own context (#32068)
Painless: Use LocalMethod Map For Lookup at Runtime (#32599)
[TEST] Enhance failure message when bulk updates have failures
[ML] Add ML result classes to protocol library (#32587)
Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613)
[Rollup] Update wire version check after backport
Suppress Wildfly test in FIPS JVMs (#32543)
[Rollup] Improve ID scheme for rollup documents (#32558)
ingest: doc: move Dot Expander Processor doc to correct position (#31743)
[ML] Add some ML config classes to protocol library (#32502)
[TEST]Split transport verification mode none tests (#32488)
Core: Move helper date formatters over to java time (#32504)
[Rollup] Remove builders from DateHistogramGroupConfig (#32555)
[TEST} unmutes SearchAsyncActionTests and adds debugging info
[ML] Add Detector config classes to protocol library (#32495)
[Rollup] Remove builders from MetricConfig (#32536)
Tests: Add rolling upgrade tests for watcher (#32428)
Fix race between replica reset and primary promotion (#32442)
Rest HL client: Add get license action
Continues to use String instead of a more complex License class to
hold the license text similarly to put license.
Relates #29827
The Apache Http components support for Spnego scheme
uses canonical name by default.
Also when resolving host name, on centos by default
there are other aliases so adding them to the
DelegationPermission.
Closes#32498
If a leader index is deleted while there is an active follower, the
follower will send shard changes requests bound for the leader
index. Today this will result in a null pointer exception because there
will not be an index routing table for the index. A null pointer
exception looks like a bug to a user so this commit addresses this by
throwing an index not found exception instead.
When some remote clusters return shard failures as part of a cross-cluster search request, the cluster alias currently gets lost. As a result, if the shard failures are all caused by the same error, and against indices belonging to different clusters, but with the same index name, only one failure gets returned as part of the search response, meaning that failures are grouped by index name, ignoring the cluster alias.
With this commit we make sure that `ShardSearchFailure` returns the cluster alias as part of the index name. Also, we set the fully qualfied index name when creating a `QueryShardException`. That way shard failures are grouped by cluster:index. Such fixes should cover at least most of the cases where either 1) the shard target is set but we don't have the index in the cause (we were previously reading it only from the cause that did not have the cluster alias) 2) the shard target is missing but if the cause is a `QueryShardException` the cluster alias does not get lost.
We also prevent NPE in case the failure cause is not set and test such scenario.
If the shard is already closed while bumping the primary term, this can result in an
AlreadyClosedException to be thrown. As we use asyncBlockOperations, the exception
will be thrown on a thread from the generic thread pool and end up in the uncaught
exception handler, failing our tests.
Relates to #32442
* Change SecurityNioHttpServerTransportTests to use PEM key and
certificate files instead of a JKS keystore so that this tests
can also run in a FIPS 140 JVM
* Do not attempt to run cases with ssl.verification_mode NONE in
SessionFactoryTests so that the tests can run in a FIPS 140 JVM
This commit addresses a race that can happen in the basic CCR stats REST
tests. Namely, peek reads can fire before the REST test client fires the
stats request. This means that we have to weaken our assertions about
the expected stats response.
This modifies Def to use a Map<String, LocalMethod> to look up user-defined methods at runtime
instead of writing constant methodhandles to do the reverse lookup. This creates a consistency
between how LocalMethods are looked up at compile-time and run-time. This consistency will allow
this code to be more maintainable moving forward. This will also allow FunctionReference to be
cleaned up in a follow up PR.
This commit adds the ML results classes to the X-Pack protocol
library used by the high level REST client.
(Other commits will add the config classes and stats classes.)
These classes:
- Are publically immutable
- Are privately mutable - this is perhaps not as nice as the
config classes, but to do otherwise would require adding
builders and the corresponding server-side classes that the
old transport client used don't have builders
- Have little/no validation of field values beyond null checks
- Are convertible to and from X-Content, but NOT wire transportable
- Have lenient parsers to maximize compatibility across versions
- Have the same class names and getter names as the corresponding
classes in X-Pack core to ease migration for transport client
users
- Don't reproduce all the methods that do calculations or
transformations that the the corresponding classes in X-Pack core
have
The testPutLicense test tries to put a license generated using
snapshot keys into release cluster. This commit suppresses the
test during the release builds.
Closes#32580
Bumping down the version to 6.4 since the backport is complete. Also
adds some missing version checks to the bwc tests to make sure it
only runs on the correct versions
Previously, we were using a simple CRC32 for the IDs of rollup documents.
This is a very poor choice however, since 32bit IDs leads to collisions
between documents very quickly.
This commit moves Rollups over to a 128bit ID. The ID is a concatenation
of all the keys in the document (similar to the rolling CRC before),
hashed with 128bit Murmur3, then base64 encoded. Finally, the job
ID and a delimiter (`$`) are prepended to the ID.
This gurantees that there are 128bits per-job. 128bits should
essentially remove all chances of collisions, and the prepended
job ID means that _if_ there is a collision, it stays "within"
the job.
BWC notes:
We can only upgrade the ID scheme after we know there has been a good
checkpoint during indexing. We don't rely on a STARTED/STOPPED
status since we can't guarantee that resulted from a real checkpoint,
or other state. So we only upgrade the ID after we have reached
a checkpoint state during an active index run, and only after the
checkpoint has been confirmed.
Once a job has been upgraded and checkpointed, the version increments
and the new ID is used in the future. All new jobs use the
new ID from the start
This commit adds four ML config classes to the X-Pack protocol
library used by the high level REST client.
(Other commits will add the remaining config classes, plus results
and stats classes.)
These classes:
- Are immutable
- Have little/no validation of field values beyond null checks
- Are convertible to and from X-Content, but NOT wire transportable
- Have lenient parsers to maximize compatibility across versions
- Have the same class names, member names and getter/setter names
as the corresponding classes in X-Pack core to ease migration
for transport client users
- Don't reproduce all the methods that do calculations or
transformations that the the corresponding classes in X-Pack core
have
This commit splits SecurityNetty4TransportTests in two methods
one handling verification mode certificate and full and one
handling verification mode none. This is done so that the second
method can be muted in a FIPS 140 JVM where verification mode none
cannot be used.
Some classes use internal date formatters, which now can be moved over
to java time using the DateFormatters class.
The same applies for a few test cases.
Same motivation as #32507 but for the DateHistogramGroupConfig
configuration object. This pull request also changes the format of the
time zone from a Joda's DateTimeZone to a simple String.
It should help to port the API to the high level rest client and allows
clients to not be forced to use the Joda Time library. Serialization is
impacted but does not need a backward compatibility layer as
DateTimeZone are serialized as String anyway. XContent also expects
a String for timezone, so I found it easier to move everything to String.
Related to #29827
This unmutes the testFanOutAndCollect()` method and add a check to make
sure we aren't accidentally running something twice causing a search
phase to still be running after we have counted down the latch
Relates to #29242
This commit adds the Detector class and its dependencies to the
X-Pack protocol library used by the high level REST client.
(Future commits will add the remaining config classes, plus results
and stats classes.)
These classes:
- Are immutable, with builders, but the builders do no validation
beyond null checks
- Are convertible to and from X-Content, but NOT wire transportable
- Have lenient parsers to maximize compatibility across versions
- Have the same class names, member names and getter/setter names
as the corresponding classes in X-Pack core to ease migration
for transport client users
- Don't reproduce all the methods that do calculations or
transformations that the the corresponding classes in X-Pack core
have
These tests ensure, that the basic watch APIs are tested in the rolling
upgrade tests. After initially adding a watch, the tests try to get,
execute, deactivate and activate a watch. Watcher stats are tested as
well, and an own java based test has been added for restarting, as that
requires waiting for a state change. Watcher history is also checked.
Closes#31216
We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues
linked below), leading to the discovery of a race between resetting a replica when it learns about a
higher term and when the same replica is promoted to primary. This commit fixes the race by
distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level
operation term. The former is set during the cluster state update or when a replica learns about a
new primary. The latter is only incremented under the operation block, which can happen in a
delayed fashion. It also solves the issue where a replica that's still adjusting to the new term
receives a cluster state update that promotes it to primary, which can happen in the situation of
multiple nodes being shut down in short succession. In that case, the cluster state update thread
would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception
as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard.
This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks
(which will all take precedence over operations acquiring permits). Finally, it also moves the primary
activation of the replication tracker under the operation block, so that the actual transition to
primary only happens under the operation block.
Relates to #32431, #32304 and #32118
* master:
HLRC: Move commercial clients from XPackClient (#32596)
Add cluster UUID to Cluster Stats API response (#32206)
Security: move User to protocol project (#32367)
[TEST] Test for shard failures, add debug to testProfileMatchesRegular
Minor fix for javadoc (applicable for java 11). (#32573)
Painless: Move Some Lookup Logic to PainlessLookup (#32565)
TEST: Avoid merges in testSeqNoAndCheckpoints
[Rollup] Remove builders from HistoGroupConfig (#32533)
Mutes failing SQL string function tests due to #32589
fixed elements in array of produced terms (#32519)
INGEST: Enable default pipelines (#32286)
Remove cluster state initial customs (#32501)
Mutes LicensingDocumentationIT due to #32580
[ML] Remove multiple_bucket_spans (#32496)
[ML] Rename JobProvider to JobResultsProvider (#32551)
Correct minor typo in explain.asciidoc for HLRC
Build: Add elastic maven to repos used by BuildPlugin (#32549)
Clarify the error message when a pipeline agg is used in the 'order' parameter. (#32522)
Revert "[test] turn on host io cache for opensuse (#32053)"
Enable packaging tests on suse boxes
[ML] Improve error when no available field exists for rule scope (#32550)
[ML] Improve error for functions with limited rule condition support (#32548)
Painless: Clean Up PainlessField (#32525)
Add @AwaitsFix for #32554
Remove broken @link in Javadoc
Scripting: Conditionally use java time api in scripting (#31441)
[ML] Fix thread leak when waiting for job flush (#32196) (#32541)
Add AwaitsFix to failing test - see #32546
Core: Minor size reduction for AbstractComponent (#32509)
SQL: Added support for string manipulating functions with more than one parameter (#32356)
[DOCS] Reloadable Secure Settings (#31713)
Watcher: Reenable HttpSecretsIntegrationTests#testWebhookAction test (#32456)
[Rollup] Remove builders from TermsGroupConfig (#32507)
Use hostname instead of IP with SPNEGO test (#32514)
Switch x-pack rolling restart to new style Requests (#32339)
NETWORKING: Fix Netty Leaks by upgrading to 4.1.28 (#32511)
[DOCS] Small fixes in rule configuration page (#32516)
Painless: Clean up PainlessMethod (#32476)
Build: Remove shadowing from benchmarks (#32475)
Docs: Add all JDKs to CONTRIBUTING.md
Add licensing enforcement for FIPS mode (#32437)
SQL: Add test for handling of partial results (#32474)
Mute testFilterCacheStats
[ML][DOCS] Fix typo applied_to => applies_to
Scripting: Fix painless compiler loader to know about context classes (#32385)