extract all clauses from a conjunction query.
When clauses from a conjunction are extracted the number of clauses is
also stored in an internal doc values field (minimum_should_match field).
This field is used by the CoveringQuery and allows the percolator to
reduce the number of false positives when selecting candidate matches and
in certain cases be absolutely sure that a conjunction candidate match
will match and then skip MemoryIndex validation. This can greatly improve
performance.
Before this change only a single clause was extracted from a conjunction
query. The percolator tried to extract the clauses that was rarest in order
(based on term length) to attempt less candidate queries to be selected
in the first place. However this still method there is still a very high
chance that candidate query matches are false positives.
This change also removes the influencing query extraction added via #26081
as this is no longer needed because now all conjunction clauses are extracted.
https://www.elastic.co/guide/en/elasticsearch/reference/6.x/percolator.html#_influencing_query_extractionCloses#26307
This commit adds a parent pipeline aggregation that allows
sorting the buckets of a parent multi-bucket aggregation.
The aggregation also offers [from] and [size] parameters
in order to truncate the result as desired.
Closes#14928
Sometimes systems like Beats would want to extract the date's timezone and/or locale
from a value in a field of the document. This PR adds support for mustache templating
to extract these values.
Closes#24024.
* Add limits for ngram and shingle settings (#27211)
Create index-level settings:
max_ngram_diff - maximum allowed difference between max_gram and min_gram in
NGramTokenFilter/NGramTokenizer. Default is 1.
max_shingle_diff - maximum allowed difference between max_shingle_size and
min_shingle_size in ShingleTokenFilter. Default is 3.
Throw an IllegalArgumentException when
trying to create NGramTokenFilter, NGramTokenizer, ShingleTokenFilter
where difference between max_size and min_size exceeds the settings value.
Closes#25887
Some code-paths use anonymous classes (such as NonCollectingAggregator
in terms agg), which messes up the display name of the profiler. If
we encounter an anonymous class, we need to grab the super's name.
Another naming issue was that ProfileAggs were not delegating to the
wrapped agg's name for toString(), leading to ugly display.
This PR also fixes up the profile documentation. Some of the examples were
executing against empty indices, which shows different profile results
than a populated index (and made for confusing examples).
Finally, I switched the agg display names from the fully qualified name
to the simple name, so that it's similar to how the query profiles work.
Closes#26405
This change adds a new `_split` API that allows to split indices into a new
index with a power of two more shards that the source index. This API works
alongside the `_shrink` API but doesn't require any shard relocation before
indices can be split.
The split operation is conceptually an inverse `_shrink` operation since we
initialize the index with a _syntetic_ number of routing shards that are used
for the consistent hashing at index time. Compared to indices created with
earlier versions this might produce slightly different shard distributions but
has no impact on the per-index backwards compatibility. For now, the user is
required to prepare an index to be splittable by setting the
`index.number_of_routing_shards` at index creation time. The setting allows the
user to prepare the index to be splittable in factors of
`index.number_of_routing_shards` ie. if the index is created with
`index.number_of_routing_shards: 16` and `index.number_of_shards: 2` it can be
split into `4, 8, 16` shards. This is an intermediate step until we can make
this the default. This also allows us to safely backport this change to 6.x.
The `_split` operation is implemented internally as a DeleteByQuery on the
lucene level that is executed while the primary shards execute their initial
recovery. Subsequent merges that are triggered due to this operation will not be
executed immediately. All merges will be deferred unti the shards are started
and will then be throttled accordingly.
This change is intended for the 6.1 feature release but will not support pre-6.1
indices to be split unless these indices have been shrunk before. In that case
these indices can be split backwards into their original number of shards.
This query returns documents that match with at least one ore more
of the provided terms. The number of terms that must match varies
per document and is either controlled by a minimum should match
field or computed per document in a minimum should match script.
Closes#26915
* Update Docker docs for 6.0.0-rc2
* Update the docs to match the new Docker "image flavours" of "basic",
"platinum", and "oss".
* Clarifications for Openshift and bind-mounts
* Bump docker-compose 2.x format to 2.2
* Combine Docker Toolbox instructions for setting vm.max_map_count for
both macOS + Windows
* devicemapper is not the default storage driver any more on RHEL
Introduce minimal thread scheduler as a base class for `ThreadPool`. Such a class can be used from the `BulkProcessor` to schedule retries and the flush task. This allows to remove the `ThreadPool` dependency from `BulkProcessor`, which requires to provide settings that contain `node.name` and also needed log4j for logging. Instead, it needs now a `Scheduler` that is much lighter and gets automatically created and shut down on close.
Closes#26028
Due to a change happened via #26102 to make the nested source consistent
with or without source filtering, the _source of a nested inner hit was
always wrapped in the parent path. This turned out to be not ideal for
users relying on the nested source, as it would require additional parsing
on the client side. This change fixes this, the _source of nested inner hits
is now no longer wrapped by parent json objects, irregardless of whether
the _source is included as is or source filtering is used.
Internally source filtering and highlighting relies on the fact that the
_source of nested inner hits are accessible by its full field path, so
in order to now break this, the conversion of the _source into its binary
form is performed in FetchSourceSubPhase, after any potential source filtering
is performed to make sure the structure of _source of the nested inner hit
is consistent irregardless if source filtering is performed.
PR for #26944Closes#26944
Today all these API calls have a sideeffect of making documents visible
to search requests. While this is sometimes desired it's an unnecessary sideeffect
and now that we have an internal (engine-private) index reader (#26972) we artificially
add a refresh call for bwc. This change removes this sideeffect in 7.0.
This commit adds a note to the docs on the full_id parameter in the cat
nodes API. This is a useful parameter but was not previously documented
anywhere.
Relates #27009
This commit reformats a paragraph in the template docs to fit in 80
columns as for the rest of the doc, and as-is a standard that we loosely
adhere to.
This commit clarifies the interaction between settings specified in a
create index request, and those that would come from any templates that
apply to the create index request.
Relates #26994
The shard preference _primary, _replica and its variants were useful
for the asynchronous replication. However, with the current impl, they
are no longer useful and should be removed.
Closes#26335
Add fuzzy_transpositions parameter to multi_match and query_string queries.
Add fuzzy_transpositions, fuzzy_prefix_length and fuzzy_max_expansions
parameters to simple_query_string query.
In 5.x pure wildcard queries `*` in `query_string` are rewritten to `exists` query for efficiency.
Though this introduced a change in the document that match such queries because
`exists` query also return documents with an empty value for the field.
This change clarifies this behavior for 5.x and beyond.
Closes#26801
* review
This change adds cgroup memory usage/limit to the OS stats section of
the node stats on Linux. This information is useful because in Docker
containers the standard node stats report the host memory limit, not
taking account of extra restrictions that may have been applied to the
container.
The original idea was to store these values as Long, truncating any values
outside the range of long. However, this meant that in the relatively common
case of no limit being applied, users would not see the same value in the OS
stats as they see by querying Linux directly. So instead the values are stored
as String. This change places a burden on consumers of the strings to
convert the strings to numbers and decide what to do about extremely large
values, but there will be very few consumers and they would need to have a
policy for dealing with "no limit" in any case.
Early termination with index sorting always return the best top N in the response but set the flag `terminated_early`
in the response. This can be confusing because we use the same flag for `terminate_after` which on the contrary returns partial results.
This change removes the flag when results are not partial (early termination due to index sorting) and keeps it only when `terminate_after` is used.
Closes#26408
Numeric fields no longer support the index_options parameter. This changes the parameter
to be rejected in numeric field types after it was deprecated in 6.0.
Closes#21475
Other tokenizers like the standard tokenizer allow overriding the default
maximum token length of 255 using the `"max_token_length` parameter. This change
enables using this parameter also with the whitespace tokenizer. The range that
is currently allowed is from 0 to StandardTokenizer.MAX_TOKEN_LENGTH_LIMIT,
which is 1024 * 1024 = 1048576 characters.
Closes#26643
The JVM defaults to dumping the heap to the working directory of
Elasticsearch. For the RPM and Debian packages, this location is
/usr/share/elasticsearch. This directory is not writable by the
elasticsearch user, so by default heap dumps in this situation are
lost. This commit modifies the packaging for the RPM and Debian packages
to set the heap dump path to /var/lib/elasticsearch as the default
location for dumping the heap. This location is writable by the
elasticsearch user by default. We add documentation of this important
setting if /var/lib/elasticsearch is not suitable for receiving heap
dumps.
Relates #26755
Adds the wait_for_active_shards parameter to the index open command. Similar to the index creation command, the index open command will now, by default, wait until the primaries have been allocated.
Closes#20937
The `fielddata` field and the use of the `_name` field in the short syntax of the range
query have been deprecated in 5.0 and can be removed.
The same goes for the deprecated `score_mode` field in HasParentQueryBuilder,
the deprecated `like_text`, `ids` and `docs` parameter in the `more_like_this` query,
the deprecated query name in the short version of the `regexp` query, and several
deprecated alternative field names in other query builders.
In this test, 260b is replaced by the regexp \d+b
but the test sometimes produces results like 1.1kb
so this commit adapts the regexp to match values
with decimals
The new ops based recovery, introduce as part of #10708, is based on the assumption that all operations below the global checkpoint known to the replica do not need to be synced with the primary. This is based on the guarantee that all ops below it are available on primary and they are equal. Under normal operations this guarantee holds. Sadly, it can be violated when a primary is restored from an old snapshot. At the point the restore primary can miss operations below the replica's global checkpoint, or even worse may have total different operations at the same spot. This PR introduces the notion of a history uuid to be able to capture the difference with the restored primary (in a follow up PR).
The History UUID is generated by a primary when it is first created and is synced to the replicas which are recovered via a file based recovery. The PR adds a requirement to ops based recovery to make sure that the history uuid of the source and the target are equal. Under normal operations, all shard copies will stay with that history uuid for the rest of the index lifetime and thus this is a noop. However, it gives us a place to guarantee we fall back to file base syncing in special events like a restore from snapshot (to be done as a follow up) and when someone calls the truncate translog command which can go wrong when combined with primary recovery (this is done in this PR).
We considered in the past to use the translog uuid for this function (i.e., sync it across copies) and thus avoid adding an extra identifier. This idea was rejected as it removes the ability to verify that a specific translog really belongs to a specific lucene index. We also feel that having a history uuid will serve us well in the future.
Removing several occurrences of this typo in the docs and javadocs, seems to be
a common mistake. Corrections turn up once in a while in PRs, better to correct
some of this in one sweep.
* Fix percolator highlight sub fetch phase to not highlight query twice
The PercolatorHighlightSubFetchPhase does not override hitExecute and since it extends HighlightPhase the search hits
are highlighted twice (by the highlight phase and then by the percolator). This does not alter the results, the second highlighting
just overrides the first one but this slow down the request because it duplicates the work.