It adds notes about:
- how preference can help optimize cache usage
- the fact that too many replicas can hurt search performance due to lower
utilization of the filesystem cache
- how index sorting can improve _source compression
- how always putting fields in the same order in documents can improve _source
compression
* Add documentation for the new parent-join field
This commit adds the docs for the new parent-join field.
It explains how to define, index and query this new field.
Relates #20257
UnicodeSetFilter was only allowed in the icu_folding token filter.
It seems useful to expose this setting in icu_normalizer token filter
and char filter.
This PR extends the TranslogDeletionPolicy to allow keeping the translog files longer than what is needed for recovery from lucene. Specifically, we allow specifying the total size of the files and their maximum age (i.e., keep up to 512MB but no longer than 12 hours). This will allow making ops based recoveries more common.
Note that the default size and age still set to 0, maintaining current behavior. This is needed as the other components in the system are not yet ready for a longer translog retention. I will adapt those in follow up PRs.
Relates to #10708
This commit does two things:
1. Adds logging at the DEBUG level for when the index-N blob is
updated.
2. When attempting to delete a snapshot, if the snapshot was not found
in the repository data, an exception is now thrown instead of silently
ignoring the lack of presence of the snapshot in the repository data.
* Add a section named "relation" in the ParentJoinFieldMapper
This commit puts the parent/child definition in an inner section named "relation".
Mapping for the parent-join will look like this:
```
"join_field": {
"type": "join"
"relations":
"parent": "child"
}
}
```
to specify a `targetField`. This results in some interesting behavior that was missed in the review.
This processor sorts in-place, so there is a side-effect in both the original field and the target field.
Another bug was that the targetField was not being set if the list being sorted was fewer than two elements.
The new behavior works like this: If targetField and fieldName are not the same, we copy the list.
* Upgrade icu4j for the ICU analysis plugin to 59.1
Lucene upgraded to 59.1 so we should use the same.
Closes#21425
* Add breaking change for the icu upgrade
We use assertBusy in many places where the underlying code throw exceptions. Currently we need to wrap those exceptions in a RuntimeException which is ugly.
This commit adds a NamedXContentProvider interface that can
be implemented by plugins or modules using Java's SPI feature
in order to provide additional NamedXContent parsers to external
applications like the Java High Level Rest Client.
At index time Elasticsearch needs to look up the version associated with the
`_id` of the document that is being indexed, which is often the bottleneck for
indexing.
While reviewing the output of the `jfr` telemetry from a Rally benchmark, I saw
that significant time was spent in `ConcurrentHashMap#get` and `ThreadLocal#get`.
The reason is that we cache lookup objects per thread and segment, and for every
indexed document, we first need to look up the cache associated with this
segment (`ConcurrentHashMap#get`) and then get a state that is local to the
current thread (`ThreadLocal#get`). So if you are indexing N documents per
second and have S segments, both these methods will be called N*S times per
second.
This commit changes version lookup to use a cache per index reader rather than
per segment. While this makes cache entries live for less long, we now only need
to do one call to `ConcurrentHashMap#get` and `ThreadLocal#get` per indexed
document.
This snapshot has faster range queries on range fields (LUCENE-7828), more
accurate norms (LUCENE-7730) and the ability to use fake term frequencies
(LUCENE-7854).
This commit adds a gradle project, set inside the root build.gradle,
which controls all our bwc tests. This allows for seamless (ie no errant
CI failures) backporting of behavior.
This commit renames the needsScores method so as to make it
automatically generatable, based on the name of the `_score` variable
which is available in search scripts. It also adds documentation to
ScriptContext to explain the naming and signature of such methods.
This commit removes the global caching of the field query and replaces it with
a caching per field. Each field can use a different `highlight_query` and the rewriting of
some queries (prefix, automaton, ...) depends on the targeted field so the query used for highlighting
must be unique per field.
There might be a small performance penalty when highlighting multiple fields since the query needs to be rewritten
once per highlighted field with this change.
Fixes#25171
* Remove QUERY_AND_FETCH BWC for pre-5.3.0 nodes
This was a BWC layer where we expicitly set the `search_type` to
"query_and_fetch" when a single node is queried on pre-5.3 nodes. Since 6.0 no
longer needs to be compatible with 5.3 nodes, this can be removed.
* Fix indentation
* Remove unused QUERY_FETCH_ACTION_NAME constant
* Add more missing AggregationBuilder getters
- getMetadata for all aggs
- various getters on TermsAggBuilder (without "get" prefix to maintain convention)
- Also makes InternalSum's ctor public, to follow suit of other metrics (min/max/avg/etc)
We introduced a new API for ranges in order to be able to decide whether points
or doc values would be more appropriate to execute a query, but since
`ProfileWeight` does not implement this API, the optimization is disabled when
profiling is enabled.
In #25201, a setting was added to allow setting the retry timeout for the rest client under the
impression that this would allow requests to go longer than 30s. However, there is also a socket
timeout that needs to be set to greater than 30s, which this change adds a setting for.
Those plugins don't replace the discovery logic but rather only provide a custom unicast host provider for their respective platforms. in 5.1 we introduced the `discovery.zen.hosts_provider` setting to better reflect it. This PR removes BWC code in those plugins as it is not needed anymore
Fixes#24543
This prevents possible race conditions between the Elasticsearch JVM and
plugin native controller processes that can cause the Elasticsearch shutdown
to hang. The problem can happen when the JVM and the controller process
receive a SIGTERM at almost the same time.
(There's an assumption here that Elasticsearch will continue to use other
mechanisms to kill native controller processes.)
* Port support for commercial GeoIP2 databases from Logstash.
* Match GeoIP databases according to the database name suffix.
* Rename CITY/COUNTRY_DB_TYPE, since they are suffixes now.
Expose the experimental simplepattern and
simplepatternsplit tokenizers in the common
analysis plugin. They provide tokenization based
on regular expressions, using Lucene's
deterministic regex implementation that is usually
faster than Java's and has protections against
creating too-deep stacks during matching.
Both have a not-very-useful default pattern of the
empty string because all tokenizer factories must
be able to be instantiated at index creation time.
They should always be configured by the user
in practice.
This commit adds a setting to change the request timeout for the rest client. This is useful as the
default timeout is 30s, which is also the same default for calls like cluster health. If both are
the same then the response from the cluster health api will not be received as the client usually
times out first making test failures harder to debug.
Relates #25185
The secure repository-hdfs tests fail on JDK 9 because some Hadoop code
reaches into sun.security.krb5. This commit adds the necessary flags to
open the java.security.jgss module. Note that these flags are actually
needed at runtime as well when using secure repository-hdfs. For now we
will punt on how best to help users obtain this when running on JDK 9
with this plugin.
Relates #25205