Currently test that check that equals() and hashCode() are working as expected
for classes implementing them are quiet similar. This change moves common
assertions in this method to a common utility class. In addition, another common
utility function in most of these test classes that creates copies of input
object by running them through a StreamOutput and reading them back in, is moved
to ESTestCase so it can be shared across all these classes.
Closes#20629
Today the request interceptor can't support async calls since the response
of the async call would execute on a different thread ie. a client or listener
thread. This means in-turn that the intercepted handler is not executed with the
thread it was supposed to run and therefor can, if it's executing blocking
operations, potentially deadlock an entire server.
* Move all zen discovery classes into o.e.discovery.zen
This collapses sub packages of zen into zen. These all had just a couple
classes each, and there is really no reason to have the subpackages.
* fix checkstyle
`LocalDiscovery` is a discovery implementation that uses static in memory maps to keep track of current live nodes. This is used extensively in our tests in order to speed up cluster formation (i.e., shortcut the 3 second ping period used by `ZenDiscovery` by default). This is sad as that mean that most of the test run using a different discovery semantics than what is used in production. Instead of replacing the entire discovery logic, we can use a similar approach to only shortcut the pinging components.
During a recent merge from master, we lost the bridge from IndicesClusterStateService to the GlobalCheckpointService of primary shards, notifying them of changes to the current set of active/initializing shards. This commits add the bridge back (with unit tests). It also simplifies the GlobalCheckpoint tracking to use a simpler model (which makes use the fact that the global check point sync is done periodically).
The old integration CheckpointIT test is moved to IndexLevelReplicationTests. I also added similar assertions to RelocationsIT, which surfaced a bug in the primary relocation logic and how it plays with global checkpoint updates. The test is currently await-fixed and will be fixed in a follow up issue.
This commit changes the current REST API parser to make it fail and throw an exception when a REST specification file contains a duplicated parameters, or path, or method, or path part.
This is important to allow any test to use RandomQueryBuilder#createQuery()
since some of the query builders that are used in this test test the length
of the types array and otherwise will thow NPE if the test is not a subclass
of AbstractQueryTestCase.
The cache relies on the equals() method so we just need to make sure script
queries can never be equals, even to themselves in the case that a weight
is used to produce a Scorer on the same segment multiple times.
Closes#20763
Today we throw an assertion error if we release an AbstractArray more than once.
Yet, it's recommended to implement close methods such that they can be invoked
more than once. Guaranteed single release calls are hard to implement and some
situations might not be tested causing for instance `CircuitBreaker` to operate on
corrupted memory stats.
UpdateHelper, MetaDataIndexUpgradeService, and some recovery
stuff.
Move ClusterSettings to nullable ctor parameter of TransportService
so it isn't forgotten.
This change proposes the removal of all non-tcp transport implementations. The
mock transport can be used by default to run tests instead of local transport that has
roughly the same performance compared to TCP or at least not noticeably slower.
This is a master only change, deprecation notice in 5.x will be committed as a
separate change.
Today SearchContext expose the current context as a thread local which makes any kind of sane interface design very very hard. This PR removes the thread local entirely and instead passes the relevant context anywhere needed. This simplifies state management dramatically and will allow for a much leaner SearchContext interface down the road.
LongGCDisruption suspends and resumes node threads but respects several
`unsafe` class name patterns where it's unsafe to suspend. For instance
log4j uses a global lock so we can't suspend a thread that is currently
calling into log4j. The same is true for the security manager, it's similar
to log4j a shared resource between the test and the node that is _suspended_.
This change adds `java.lang.SecrityManager` to the unsafe patterns.
This prevents test framework deadlocking if a nodes thread is supended
while it's calling into the security manager that uses synchronized maps etc.
today it's not possible to use date-math efficiently with the `_rollover`
API. This change adds support for date-math in the target index as well as
support for preserving the math logic when an existing index that was created with
a date math expression all subsequent indices are created with the same expression.
The logging listener tests started failing after
953a8a959b when the tests are run with
tests.es.logger.level set to any level other than debug. This is because
these tests were based around the assumption that the default logging
level was info, which was the case before that commit fixed setting the
default logging level via that system property. This commit fixes these
failing tests by adjusting this assumption to account for the fact that
the default logging level could be different.
Pipe in the `tests.es.logger.level` system property to the log4j config file used in tests. We still default to info. Also adapts the logger name to use the first letter of packages.
Pipe in the `tests.es.logger.level` system property to the log4j config file used in tests. We still default to info. Also adapts the logger name to use the first letter of packages.
* master: (1199 commits)
[DOCS] Remove non-valid link to mapping migration document
Revert "Default `include_in_all` for numeric-like types to false"
test: add a test with ipv6 address
docs: clearify that both ip4 and ip6 addresses are supported
Include complex settings in settings requests
Add production warning for pre-release builds
Clean up confusing error message on unhandled endpoint
[TEST] Increase logging level in testDelayShards()
change health from string to enum (#20661)
Provide error message when plugin id is missing
Document that sliced scroll works for reindex
Make reindex-from-remote ignore unknown fields
Remove NoopGatewayAllocator in favor of a more realistic mock (#20637)
Remove Marvel character reference from guide
Fix documentation for setting Java I/O temp dir
Update client benchmarks to log4j2
Changes the API of GatewayAllocator#applyStartedShards and (#20642)
Removes FailedRerouteAllocation and StartedRerouteAllocation
IndexRoutingTable.initializeEmpty shouldn't override supplied primary RecoverySource (#20638)
Smoke tester: Adjust to latest changes (#20611)
...
Many of our unit tests instantiate an `AllocationService`, which requires having a `GatewayAllocator`. Today almost all of our test use a class called `NoopGatewayAllocator` which does nothing, effectively leaving all shard assignments to the balanced allocator. This is sad as it means we test a system that behaves differently than our production logic in very basic things. For example, a started primary that is lost will be assigned to a node that didn't use to have it.
This PR removes `NoopGatewayAllocator` in favor of a new `TestGatewayAllocator` that inherits the standard `GatewayAllocator` and overrides shard information fetching to return information based on historical assignments the allocator has done. The only exception is `BalanceConfigurationTests` which does test only the balancer and I opted to not have it work around the `GatewayAllocator` being in it's way.
Changes the API of GatewayAllocator#applyStartedShards and
GatewayAllocator#applyFailedShards to take both a RoutingAllocation
and a list of shards to apply. This allows better mock allocators
to be created as being done in #20637.
Closes#20642
Removes the FailedRerouteAllocation class and StartedRerouteAllocation
class, as they were just wrappers for RerouteAllocation that stored
started and failed shards, but these started and failed shards can
be passed in directly to the methods that needed them, removing the
need for this wrapper class and extra level of indirection.
Closes#20626
Today we hold on to all possible tokenizers, tokenfilters etc. when we create
an index service on a node. This was mainly done to allow the `_analyze` API to
directly access all these primitive. We fixed this in #19827 and can now get rid of
the AnalysisService entirely and replace it with a simple map like class. This
ensures we don't create a gazillion long living objects that are entirely useless since
they are never used in most of the indices. Also those objects might consume a considerable
amount of memory since they might load stopwords or synonyms etc.
Closes#19828
With the switch to Log4j 2 throughout our code base, the logger usage checker was temporarily disabled. This commit
adapts the checks to work with Log4j 2 and re-enables the Gradle checks.
Closes#20243
This commit changes the default behavior of `_flush` to block if other flushes are ongoing.
This also removes the use of `FlushNotAllowedException` and instead simply return immediately
by skipping the flush. Users should be aware if they set this option that the flush might or might
not flush everything to disk ie. no transactional behavior of some sort.
Closes#20569
This commit removes `ByteSizeValue`'s methods that are duplicated (ex: `mbFrac()` and `getMbFrac()`) in order to only keep the `getN` form.
It also renames `mb()` -> `getMb()`, `kb()` -> `getKB()` in order to be more coherent with the `ByteSizeUnit` method names.
This PR introduces backward compatibility index tests to test the rolling upgrade process amongst Elasticsearch instances within the same major version. The test executes in three phases. In the first phase, we form a cluster of 2 ES instances on an old version. In the second phase, we keep one of the nodes from the old cluster, kill the other node, but preserve its data directory and start an instance of the current version of ES using the same data directory as the killed instance. In the third phase, we kill the other old version ES instance from the first phase and launch a new instance, using the same data directory as the killed instance. Therefore, during phase 3, we have fully migrated and have all current versions of ES running. In each phase, we run REST tests that index documents and search them, ensuring at each stage that the documents from the previous phase are still there.
Note that because we haven't released a GA yet of 5.0, the tests currently don't start an old version cluster in the first phase. Once GA is released, this will be changed to make the backward compatibility version 5.0, while the current version in the cluster will be 5.x.
This change removes all guice interaction from Transport, HttpServerTransport,
HttpServer and TransportService. All these classes as well as their subclasses
or extended version configured via plugins are now created by using plain old
bloody java constructors. YAY!
Currently all the reroute-like methods of `AllocationService` return a result object of type `RoutingAllocation.Result`. The result object contains the new `RoutingTable` and `MetaData` plus an indication whether those were changed. The caller is then responsible of updating a cluster state with these. These means that things can easily go wrong and one can take one of these but not the other causing inconsistencies. We already have a utility method on the `ClusterState` builder that does but no one forces you to do so. Also 99% of the callers do the same thing: i.e., check if the result was changed and if so update the very same cluster state that was passed to `AllocationService`. This PR folds this pattern into `AllocationService` and changes almost all it's methods to return a new cluster state (potentially the original one). This saves some 500 lines of code.
The one exception here is the reroute API which executes allocation commands and potentially returns an explanation as well (next to the routing table and metadata). That API now returns a `CommandsResult` object which encapsulate a cluster state and the explanation.
TransportService is such a central part of the core server, replacing
it's implementation is risky and can cause serious issues. This change removes the ability to
plug in TransportService but allows registering a TransportInterceptor that enables
plugins to intercept requests on both the sender and the receiver ends. This is a commonly used
and overwritten functionality but encapsulates the custom code in a contained manner.
During a networking partition, cluster states updates (like mapping changes or shard assignments)
are committed if a majority of the masters node received the update correctly. This means that the current master has access to enough nodes in the cluster to continue to operate correctly. When the network partition heals, the isolated nodes catch up with the current state and get the changes they couldn't receive before. However, if a second partition happens while the cluster
is still recovering from the previous one *and* the old master is put in the minority side, it may be that a new master is elected which did not yet catch up. If that happens, cluster state updates can be lost.
This commit fixed 95% of this rare problem by adding the current cluster state version to `PingResponse` and use them when deciding which master to join (and thus casting the node's vote).
Note: this doesn't fully mitigate the problem as a cluster state update which is issued concurrently with a network partition can be lost if the partition prevents the commit message (part of the two phased commit of cluster state updates) from reaching any single node in the majority side *and* the partition does allow for the master to acknowledge the change. We are working on a more comprehensive fix but that requires considerate work and is targeted at 6.0.
The only repository we can be sure is safe to clean is `fs` so we clean
any snapshots in those repositories after each test. Other repositories
like url and azure tend to throw exceptions rather than let us fetch
their contents during the REST test. So we clean what we can....
Closes#18159
LongGCDisruption simulates a Long GC by suspending all threads belonging to a node. That's fine, unless those threads hold shared locks that can prevent other nodes from running. Concretely the logging infrastructure, which is shared between the nodes, can cause some deadlocks. LongGCDisruption has protection for this, but it needs to be updated to point at log4j2 classes, introduced in #20235
This commit also fixes improper handling of retry logic in LongGCDisruption and adds a protection against deadlocking the test code which activates the disruption (and uses logging too! :)).
On top of that we have some new, evil and nasty tests.
`TransportService#registerRequestHandler` allowed to register
handlers more than once and issues an annoying warn log message when
this happens. This change simple throws an exception to prevent regsitering
the same handler more than once. This commit also removes the ability
to remove request handlers.
Relates to #20468
After this change SearchModule doesn't subclass AbstractModule anymore and all wiring
happens in `Node.java`. As a side-effect several tests don't need a guice injector anymore.
This commit modifies the logger names within Elasticsearch to be the
fully-qualified class name as opposed removing the org.elasticsearch
prefix and dropping the class name. This change separates the root
logger from the Elasticsearch loggers (they were equated from the
removal of the org.elasticsearch prefix) and enables log levels to be
set at the class level (instead of the package level).
Relates #20457
Today we add a prefix when logging within Elasticsearch. This prefix
contains the node name, and index and shard-level components if
appropriate.
Due to some implementation details with Log4j 2 , this does not work for
integration tests; instead what we see is the node name for the last
node to startup. The implementation detail here is that Log4j 2 there is
only one logger for a name, message factory pair, and the key derived
from the message factory is the class name of the message factory. So,
when the last node starts up and starts setting prefixes on its message
factories, it will impact the loggers for the other nodes.
Additionally, the prefixes are lost when logging an exception. This is
due to another implementation detail in Log4j 2. Namely, since we log
exceptions using a parameterized message, Log4j 2 decides that that
means that we do not want to use the message factory that we have
provided (the prefix message factory) and so logs the exception without
the prefix.
This commit fixes both of these issues.
Relates #20429
This commit cuts over geo_point fields to use Lucene's new point-based LatLonPoint type for indexes created in 5.0. Indexes created prior to 5.0 continue to use their respective encoding type. Below is a description of the changes made to support the new encoding type:
* New indexes use a new LatLonPointFieldMapper which provides a parse method for the new type
* The new LatLonPoint parse method removes support for lat_lon and geohash parameters
* Backcompat testing for deprecated lat_lon and geohash parameters is added to all unit and integration tests
* LatLonPointFieldMapper provides DocValues support (enabled by default) which uses Lucene's new LatLonDocValuesField type
* New LatLonPoint field data classes are added for aggregation support (wraps LatLonPoint's Numeric Doc Values)
* MultiFields use the geohash as the string value instead of the lat,lon string making it easier to perform geo string queries on the geohash instead of a lat,lon comma delimited string.
Removed Features:
* With the removal of geohash indexing, GeoHashCellQuery support is removed for all new indexes (still supported on existing indexes)
* LatLonPoint does not support a Distance Range query because it is super inefficient. Instead, the geo_distance_range query should be accomplished using either the geo_distance aggregation, sorting by descending distance on a geo_distance query, or a boolean must not of the excluded distance (which is what the distance_range query did anyway).
TODO:
* fix/finish yaml changes for plugin and rest integration tests
* update documentation
This commit adds a -q/--quiet option to Elasticsearch so that it does not log anything in the console and closes stdout & stderr streams. This is useful for SystemD to avoid duplicate logs in both journalctl and /var/log/elasticsearch/elasticsearch.log while still allows the JVM to print error messages in stdout/stderr if needed.
closes#17220
In 5.x we allowed this with a deprecation warning. This removes the code
added for that deprecation, requiring the cluster name to not be in the
data path.
Resolves#20391
This change removes the guice dependency handling for SearchService and
several related classes like SearchTransportController and SearchPhaseController.
The latter two now have package private constructors and dependencies like FetchPhase
are now created by calling their constructors explicitly. This also cleans up several users
of the DefaultSearchContext and centralized it's creation inside SearchService.
Splits the PrimaryShardAllocator and ReplicaShardAllocator's decision
making for a shard from the implementation of that decision on the
routing table. This is a step toward making it easier to use the same
logic for the cluster allocation explain APIs.
Introduce a base class for unit tests that are based on real `IndexShard`s. The base class takes care of all the little details needed to create and recover shards.
This commit also moves `IndexShardTests` and `ESIndexLevelReplicationTestCase` to use the new base class. All tests in `IndexShardTests` that required a full node environment were moved to a new `IndexShardIT` suite.
Before, when there was a new cluster state to publish,
zen discovery would first update the set of nodes to
ping based on the new cluster state, then publish the new
cluster state. This is problematic because if the cluster
state failed to publish, then the set of nodes to ping
should not have been updated.
This commit fixes the issue by updating the set of
nodes to ping for fault detection only *after* the new
cluster state has been published.
Search section supports an ext section that is used to provide additional config needed from plugins. It is now tied to sub fetch phases because it is the only section that may need additional config, but there is no reason for the two to be tightly coupled.
It is now possible to register a searchExtParser independently from a sub fetch phase. All a search ext parser does is parsing some ext section of a search request, whose parsed resulting object is stored in the search context for later retrieval.
The context was an object where the parsed info are stored. That is more of what we call the builder since after the search refactoring. No need for generics in FetchSubPhaseParser then. Also the previous setHitsExecutionNeeded wasn't useful, it can be removed as well, given that once there is a parsed ext section, it will become a builder that can be retrieved by the sub fetch phase. The sub fetch phase is responsible for doing nothing in case the builder is not set, meaning that the fetch sub phase is plugged in but the request didn't have the corresponding section.
SearchParseElement is renamed to FetchSubPhaseParser and moved to the search.fetch package. Its parse method doesn't get the SearchContext as argument anymore, only the XContentParser, and the return type is what gets parsed (the fetch sub phase context which we may as well rename later).
It is the parser that initializes the FetchSubPhaseContext then. SearchService retrieves the parser by name, calls parse against it and stores the result of parsing by name. No need for FetchSubPhase.ContextFactory anymore, which can be removed.
Given that doc value fields is our own fetch sub phase, it doesn't need to be implemented like if it was plugged in from the outside. It doesn't need its own fetch sub phase context, but it can just be an instance member in SearchContext
Previously we would disable console logging in certain circumstances
(for example, if Elasticsearch is not in the foreground, or if
Elasticsearch is in the foreground but an exception was thrown during
bootstrap). This commit makes this handling work with Log4j 2. This will
prevent users from seeing double bootstrap check failure messages.
Relates #20387
By default, when an exception causes the JVM to terminate, the stack
trace is printed. In the case of failing bootstrap checks, this stack
trace is useless to the user, and might even distract them from seeing
that the bootstrap checks failed for reasons under their control. With
this commit, we cause the stack trace for a failing bootstrap check to
be truncated.
We also modify some methods to not declare that they throw the top level
checked exception type Exception, but instead explicitly declare the
exceptions that they throw. These exceptions are caught and wrapped in a
BootstrapException so that we can percolate only two exception types out
of Bootstrap#init as checked exception, BootstrapException and
NodeValidationException.
Relates #19989
This commit cleans most of the methods of XContentBuilder so that:
- Jackson's convenience methods are used instead of our custom ones (ie field(String,long) now uses Jackson's writeNumberField(String, long) instead of calling writeField(String) then writeNumber(long))
- null checks are added for all field names and values
- methods are grouped by type in the class source
- methods have the same parameters names
- duplicated methods like field(String, String...) and array(String, String...) are removed
- varargs methods now have the "array" name to reflect that it builds arrays
- unused methods like field(String,BigDecimal) are removed
- all methods now follow the execution path: field(String,?) -> field(String) then value(?), and value(?) -> writeSomething() method. Methods to build arrays also follow the same execution path.
Exposing lucene 6.x minhash tokenfilter
Generate min hash tokens from an incoming stream of tokens that can
be used to estimate document similarity.
Closes#20149
Jython shades `jansi` into it's classpath without changing it's package or
anything like that. This causes attempts to load native code on windows which
blows up tests. This change adds `log4j.skipJansi=true` system property to our
tests as well as to the JVM properties we set.
The BackgroundIndexer now uses auto-generated IDs randomly. This causes some problems
for tests that still rely on the fact that the IDs are increasing integers. This change
exposes all IDs via a Set<String> to iterate over for tests.
This commit configures test logging for Log4j 2. The default logger
configuration uses the console appender but at the error level, so most
tests are missing logging. Instead, this commit provides a configuration
for tests which is picked up from the classpath by Log4j 2 when it
initializes. However, this now means that we can no longer initialize
Log4j with a bare-bones configuration when tests run as doing so will
prevent Log4j 2 from attempting to configure logging via the
classpath. Consequently, we move this needed initialization (as
commented, to avoid a message about a status logger not being configured
when we are preparing to configure Log4j from properties files in the
config directory) to only run when we are explicitly configuring Log4j
from properties files.
Relates #20284
If elasticsearch controls the ID values as well as the documents
version we can optimize the code that adds / appends the documents
to the index. Essentially we an skip the version lookup for all
documents unless the same document is delivered more than once.
On the lucene level we can simply call IndexWriter#addDocument instead
of #updateDocument but on the Engine level we need to ensure that we deoptimize
the case once we see the same document more than once.
This is done as follows:
1. Mark every request with a timestamp. This is done once on the first node that
receives a request and is fixed for this request. This can be even the
machine local time (see why later). The important part is that retry
requests will have the same value as the original one.
2. In the engine we make sure we keep the highest seen time stamp of "retry" requests.
This is updated while the retry request has its doc id lock. Call this `maxUnsafeAutoIdTimestamp`
3. When the engine runs an "optimized" request comes, it compares it's timestamp with the
current `maxUnsafeAutoIdTimestamp` (but doesn't update it). If the the request
timestamp is higher it is safe to execute it as optimized (no retry request with the same
timestamp has been run before). If not we fall back to "non-optimzed" mode and run the request as a retry one
and update the `maxUnsafeAutoIdTimestamp` unless it's been updated already to a higher value
Relates to #19813
* master:
Avoid NPE in LoggingListener
Randomly use Netty 3 plugin in some tests
Skip smoke test client on JDK 9
Revert "Don't allow XContentBuilder#writeValue(TimeValue)"
[docs] Remove coming in 2.0.0
Don't allow XContentBuilder#writeValue(TimeValue)
[doc] Remove leftover from CONSOLE conversion
Parameter improvements to Cluster Health API wait for shards (#20223)
Add 2.4.0 to packaging tests list
Docs: clarify scale is applied at origin+offest (#20242)
* Params improvements to Cluster Health API wait for shards
Previously, the cluster health API used a strictly numeric value
for `wait_for_active_shards`. However, with the introduction of
ActiveShardCount and the removal of write consistency level for
replication operations, `wait_for_active_shards` is used for
write operations to represent values for ActiveShardCount. This
commit moves the cluster health API's usage of `wait_for_active_shards`
to be consistent with its usage in the write operation APIs.
This commit also changes `wait_for_relocating_shards` from a
numeric value to a simple boolean value `wait_for_no_relocating_shards`
to set whether the cluster health operation should wait for
all relocating shards to complete relocation.
* Addresses code review comments
* Don't be lenient if `wait_for_relocating_shards` is set
* master:
Increase visibility of deprecation logger
Skip transport client plugin installed on JDK 9
Explicitly disable Netty key set replacement
percolator: Fail indexing percolator queries containing either a has_child or has_parent query.
Make it possible for Ingest Processors to access AnalysisRegistry
Allow RestClient to send array-based headers
Silence rest util tests until the bogusness can be simplified
Remove unknown HttpContext-based test as it fails unpredictably on different JVMs
Tests: Improve rest suite names and generated test names for docs tests
Add support for a RestClient base path
This commit adds an empty test to ESLoggerUsageTests to avoid the test
suite from failing for having no tests after the existing tests were
marked as awaits fix in 1d197eddcc.
This commit modifies the call sites that allocate a parameterized
message to use a supplier so that allocations are avoided unless the log
level is fine enough to emit the corresponding log message.
Rest test suites are currently only the directory above the yaml test
file. That is confusing when there are more than one directory level
which contain yaml tests, as there are in generated docs tests. This
change makes rest tests use the full relative path to the rest test root
as the suite name, and also makes the test names for docs tests a little
clearer (that they are testing an example from a specific line number,
instead of just the line number as an opaque test name).
Removed null check for token, if we are outside the null it already means it is null.
Fixed typo in comment and remove leftover assignment to unused local variable.
This test is periodically failing. As I suspect that the GCDisruption scheme is somehow making the wrong node block on
its cluster state update thread, I've added some more logging and a thread dump once the given assertion triggers
again.
Adds an explicit recoverySource field to ShardRouting that characterizes the type of recovery to perform:
- fresh empty shard copy
- existing local shard copy
- recover from peer (primary)
- recover from snapshot
- recover from other local shards on same node (shrink index action)