This is related to #29500 and #28898. This commit removes the abilitiy
to disable http pipelining. After this commit, any elasticsearch node
will support pipelined requests from a client. Additionally, it extracts
some of the http pipelining work to the server module. This extracted
work is used to implement pipelining for the nio plugin.
=== Char Group Tokenizer
The `char_group` tokenizer breaks text into terms whenever it encounters
a
character which is in a defined set. It is mostly useful for cases where
a simple
custom tokenization is desired, and the overhead of use of the
<<analysis-pattern-tokenizer, `pattern` tokenizer>>
is not acceptable.
=== Configuration
The `char_group` tokenizer accepts one parameter:
`tokenize_on_chars`::
A string containing a list of characters to tokenize the string on.
Whenever a character
from this list is encountered, a new token is started. Also supports
escaped values like `\\n` and `\\f`,
and in addition `\\s` to represent whitespace, `\\d` to represent
digits and `\\w` to represent letters.
Defaults to an empty list.
=== Example output
```The 2 QUICK Brown-Foxes jumped over the lazy dog's bone for $2```
When the configuration `\\s-:<>` is used for `tokenize_on_chars`, the
above sentence would produce the following terms:
```[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone,
for, $2 ]```
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.
Closes#30557
This commit adds Delete Repository, the associated docs and tests for
the high level REST API client. It also cleans up a seemingly innocuous
line in the RestDeleteRepositoryAction and some naming in SnapshotIT.
Relates #27205
The getDate() and getDates() existed prior to 5.x on long fields in
scripting. In 5.x, a new Date type for ScriptDocValues was added. The
getDate() and getDates() methods were left on long fields and added to date
fields to ease the transition. This commit removes those methods for
7.0.
The docs/README shows how to run the :docs:check goal on a subset of pages,
however using the current example fails. Removing the masking character before
the first wildcard fixes the problem.
Meta plugins existed only for a short time, in order to enable breaking
up x-pack into multiple plugins. However, now that x-pack is no longer
installed as a plugin, the need for them has disappeared. This commit
removes the meta plugins infrastructure.
This pipeline aggregation gives the user the ability to script functions that "move" across a window
of data, instead of single data points. It is the scripted version of MovingAvg pipeline agg.
Through custom script contexts, we expose a number of convenience methods:
- MovingFunctions.max()
- MovingFunctions.min()
- MovingFunctions.sum()
- MovingFunctions.unweightedAvg()
- MovingFunctions.linearWeightedAvg()
- MovingFunctions.ewma()
- MovingFunctions.holt()
- MovingFunctions.holtWinters()
- MovingFunctions.stdDev()
The user can also define any arbitrary logic via their own scripting, or combine with the above methods.
This change adds a `listTasks` method to the high level java
ClusterClient which allows listing running tasks through the
task management API.
Related to #27205
This commit adds Create Repository, the associated docs and tests
for the high level REST API client. A few small changes to the
PutRepository Request and Response went into the commit as well.
This commit exposes the master version to the REST test context. This
will be needed in a follow-up where the master version will be used to
determine whether or not a certain warning header is expected.
This does away with the deprecated `com.google.api-client:google-api-client:1.23`
and replaces it with `com.google.cloud:google-cloud-storage:1.28.0`.
It also changes security permissions for the repository-gcs plugin.
We are starting over on the changelog with a different approach. This
commit removes the existing incarnation of the changelog to remove
confusion that we need to continue adding entries to it.
The geo_bounding_box query might produce false positives alongside
the right and upper edges and false negatives alongside left and
bottom edges. This commit documents the behavior and defines the
maximum error.
Closes#29196
This commit changes the default out-of-the-box configuration for the
number of shards from five to one. We think this will help address a
common problem of oversharding. For users with time-based indices that
need a different default, this can be managed with index templates. For
users with non-time-based indices that find they need to re-shard with
the split API in place they no longer need to resort only to
reindexing.
Since this has the impact of changing the default number of shards used
in REST tests, we want to ensure that we still have coverage for issues
that could arise from multiple shards. As such, we randomize (rarely)
the default number of shards in REST tests to two. This is managed via a
global index template. However, some tests check the templates that are
in the cluster state during the test. Since this template is randomly
there, we need a way for tests to skip adding the template used to set
the number of shards to two. For this we add the default_shards feature
skip. To avoid having to write our docs in a complicated way because
sometimes they might be behind one shard, and sometimes they might be
behind two shards we apply the default_shards feature skip to all docs
tests. That is, these tests will always run with the default number of
shards (one).
The High Level REST Client's documentation suggested that users should
use the Low Level REST Client for index management activities. This
change removes that suggestion because the high level REST client
supports those APIs now.
This also changes the examples in the migration docs to that still use
the Low Level REST Client to use the non-deprecated varieats of
`performRequest`.
We want copying settings to be the default behavior. This commit
deprecates not copying settings, and disallows explicitly not copying
settings. This gives users a transition path to the future default
behavior.
Dates internally contain milliseconds (which appear when converting them
to Strings) however parsing does not accept them (and is being strict).
The parser has been changed so that Date is mandatory but the time
(including its fractions such as millis) are optional.
Fix#30002
We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built *an* analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.
This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
`_analyze` API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to `_anlayze` takes about 20ms on my laptop which
seems fine.
This commit adds the Snapshot Client with a first API call within it,
the get repositories call in snapshot/restore module. This also creates
a snapshot namespace for the docs, as well as get repositories docs.
Relates #27205
Today we can execute cluster API actions on only master, data or ingest nodes
using the `master:true`, `data:true` and `ingest:true` filters, but it is not
so easy to select coordinating-only nodes (i.e. those nodes that are neither
master nor data nor ingest nodes). This change fixes this by adding support for
a `coordinating_only` filter such that `coordinating_only:true` adds all
coordinating-only nodes to the set of selected nodes, and
`coordinating_only:false` deletes them.
Resolves#28831.
The HTTPClient used in watcher is based on the apache http client. The
current client is using a lot of defaults - which are not always
optimal. Two of those defaults are the maximum number of total
connections and the maximum number of connections to a single route.
If one of those limits is reached, the HTTPClient waits for a connection
to be finished thus acting in a blocking fashion. In order to prevent
this when many requests are being executed, we increase the limit of
total connections as well as the connections per route (a route is
basically an endpoint, which also contains proxy information, not
containing an URL, just hosts).
On top of that an additional option has been set to evict
long running connections, which can potentially be reused after some
time. As this requires an additional background thread, this required
some changes to ensure that the httpclient is closed properly. Also the
timeout for this can be configured.
Deprecate the many arguments versions of `performRequest` and
`performRequestAsync` in favor of the `Request` object flavored variants
introduced in #29623. We'll be dropping the many arguments variants in
7.0 because they make it difficult to add new features in a backwards
compatible way and they create a *ton* of intellisense noise.
We had been using `task_id:1` or `taskId:1` because it is parses as a
valid task identifier but the `:1` part is confusing. This replaces
those examples with `task_id` which matches the response from the list
tasks API.
Closes#28314
Fixes and edge case when using `more_like_this` where TermVectorsWriter
could throw an NPE when a field produced zero tokens after analysis. This
changes the implementation to use an empty list of tokens in this case.
Closes#30148
Auto-expands replicas in the same cluster state update (instead of a follow-up reroute) where nodes are added or removed.
Closes#1873, fixing an issue where nodes drop their copy of auto-expanded data when coming up, only to sync it again later.
Adds verification that geohashes are not empty and contain only
valid characters. It fixes the issue when en empty geohash is
treated as [-180, -90] and geohashes with non-geohash character
are getting resolved into invalid coordinates.
Closes#23579
When deleting or creating a snapshot for a given shard, elasticsearch
usually starts by listing all the existing snapshotted files in the repository.
Then it computes a diff and deletes the snapshotted files that are not
needed anymore. During this deletion, an exception is thrown if the file
to be deleted does not exist anymore.
This behavior is challenging with cloud based repository implementations
like S3 where a file that has been deleted can still appear in the bucket for
few seconds/minutes (because the deletion can take some time to be fully
replicated on S3). If the deleted file appears in the listing of files, then the
following deletion will fail with a NoSuchFileException and the snapshot
will be partially created/deleted.
This pull request makes the deletion of these files a bit less strict, ie not
failing if the file we want to delete does not exist anymore. It introduces a
new BlobContainer.deleteIgnoringIfNotExists() method that can be used
at some specific places where not failing when deleting a file is
considered harmless.
Closes#28322
Today when processing a request for a URL path for which we can not find
a handler we send back a plain-text response. Yet, we have the accept
header in our hand and can respect the accepted media type of the
request. This commit addresses this.
This change adds a new plugin called `analysis-nori` that exposes
Korean text analysis in es using the new Lucene Korean analyzer module named (`nori`).
The plugin adds:
* a Korean analyzer: `nori`
* a Korean tokenizer: `nori_tokenizer`
* a part of speech stop filter: `nori_part_of_speech`
* a filter that can replace Hanja characters with their Hangul transcription: `nori_readingform`
When validating the search request, we make sure any date_histogram
aggregations have timezones that match the jobs. But we didn't
do any such validation on range queries.
While it wouldn't produce incorrect results, it would be confusing
to the user as no documents would match the aggregation (because we
add a filter clause on the timezone for the agg).
Now the user gets an exception up front, and some helpful text about
why the range query didnt match, and which timezones are acceptable
This PR adds support for the Get Settings API to the java high-level rest client.
Furthermore, logic related to the retrieval of default settings has been moved from the rest layer into the transport layer and now default settings may be retrieved consistency via both the rest API and the transport API.
Upgrade to lucene-7.4.0-snapshot-1ed95c097b
This version contains:
* An Analyzer for Korean
* An IntervalQuery and IntervalsSource that retrieve minimum intervals of positional queries.
* A new API to retrieve matches (offsets and positions) of a query for a single document.
* Support for soft deletes in the index writer.
* A fixed shingle filter that handles index time synonyms.
* Support for emoji sequence in ICUTokenizer (with an upgrade to icu 61.1)
The IndexAndAliasesResolver resolves the indices and aliases for each
request and also handles local and remote indices. The current
implementation uses the ResolvedIndices class to hold the resolved
indices and aliases. While evaluating the indices and aliases against
the user's permissions, the final value for ResolvedIndices is
constructed. Prior to this change, this was done by creating a
ResolvedIndices for the first set of indices and for each additional
addition, a new ResolvedIndices object is created and merged with
the existing one. With a small number of indices and aliases this does
not pose a large problem; however as the number of indices/aliases
grows more list allocations and array copies are needed resulting in a
large amount of garbage and severely impacted performance.
This change introduces a builder for ResolvedIndices that appends to
mutable lists until the final value has been constructed, which will
ultimately reduce the amount of garbage generated by this code.
This commit fixes an issue with the data diagnostics were
empty buckets are not reported even though they should. Once
a job is reopened, the diagnostics do not get initialized from
the current data counts (especially the latest record timestamp).
The result is that if the data that is sent have a time gap compared
to the previous ones, that gap is not accounted for in the empty bucket
count.
This commit fixes that by initializing the diagnostics with the current
data counts.
Closes#30080
The current implementation starts/stops watcher using an executor. This
can result in our of order operations.
This commit reduces those executor calls to an absolute minimum in order
to be able to do state changes within the cluster state listener method,
which runs in sequence.
When a state change occurs that forces the watcher service to pause
(like no watcher index, no master node, no local shards), the service is
now in a paused state.
Pausing is a super lightweight operation, which marks the
ExecutionService as paused and waits for the currently executing watches
to finish in the background via an executor. The same applies for
stopping, the potentially long running operation is outsourced in to an
executor, as waiting for executed watches is decoupled from the current
state.
The only other long running operation is starting, where watches need to
be loaded. This is also done via an executor, but has an additional
protection by checking the cluster state version it was started with. If
another cluster state version was trying to load the watches, then this
loading will not take effect.
This PR also cleans up some unused states, like the a simple boolean in
the HistoryStore/TriggeredWatchStore marking it as started or stopped,
as this can now be caught in the execution service.
Another advantage of this approach is the fact, that now only triggered
watches are not getting executed, while watches that are run via the
Execute Watch API will still be executed regardless if watcher is
stopped or not.
Lastly the TickerScheduleTriggerEngine thread now only starts on data nodes.
Fix NPE when CumulativeSum agg encounters null/empty bucket
If the cusum agg encounters a null value, it's because the value is
missing (like the first value from a derivative agg), the path is
not valid, or the bucket in the path was empty.
Previously cusum would just explode on the null, but this changes it
so we only increment the sum if the value is non-null and finite.
This is safe because even if the cusum encounters all null or empty
buckets, the cumulative sum is still zero (like how the sum agg returns
zero even if all the docs were missing values)
I went ahead and tweaked AggregatorTestCase to allow testing pipelines,
so that I could delete the IT test and reimplement it as AggTests.
Closes#27544
This commit removes the http.enabled setting. While all real nodes (started with bin/elasticsearch) will always have an http binding, there are many tests that rely on the quickness of not actually needing to bind to 2 ports. For this case, the MockHttpTransport.TestPlugin provides a dummy http transport implementation which is used by default in ESIntegTestCase.
closes#12792
Systemd overrides should happen through /etc/systemd/system, not
directly editing the service file. This commit removes marking the
service file as configuration for rpm and deb packages.
This adds a new `_ignored` meta field which indexes and stores fields that have
been ignored at index time because of the `ignore_malformed` option. It makes
malformed documents easier to identify by using `exists` or `term(s)` queries
on the `_ignored` field.
Closes#29494
Adds two new methods to `RestClient` that take a `Request` object. These
methods will allows us to add more per-request customizable options
without creating more and more and more overloads of the `performRequest`
and `performRequestAsync` methods. These new methods look like:
```
Response performRequest(Request request)
```
and
```
void performRequestAsync(Request request, ResponseListener responseListener)
```
This change doesn't add any actual features but enables adding things like
per request timeouts and per request node selectors. This change *does*
rework the `HighLevelRestClient` and its tests to use these new `Request`
objects and it does update the docs.
Since we disable allocation using persistent settings, we should be consistent and remove
the setting from the persistent storage. Otherwise an accidental restart will lead for shards
not being allocated.
Relates to #28757
Since we disable allocation using persistent settings, we should be consistent and remove
the setting from the persistent storage. Otherwise an accidental restart will leed for shards
not being allocated.
Relates to https://github.com/elastic/elasticsearch/pull/28757
Today when an index is created from shrinking or splitting an existing
index, the target index inherits almost none of the source index
settings. This is surprising and a hassle for operators managing such
indices. Given this is the default behavior, we can not simply change
it. Instead, we start by introducing the ability to copy settings. This
flag can be set on the REST API or on the transport layer and it has the
behavior that it copies all settings from the source except non-copyable
settings (a property of a setting introduced in this
change). Additionally, settings on the request will always override.
This change is the first step in our adventure:
- this flag is added here in 7.0.0 and immediately deprecated
- this flag will be backported to 6.4.0 and remain deprecated
- then, we will remove the ability to set this flag to false in 7.0.0
- finally, in 8.0.0 we will remove this flag and the only behavior will
be for settings to be copied
Currently, the only way to get the REST response for the `/_cluster/state`
call to return the `cluster_uuid` is to request the `metadata` metrics,
which is one of the most expensive response structures. However, external
monitoring agents will likely want the `cluster_uuid` to correlate the
response with other API responses whether or not they want cluster
metadata.
Add yet another warning about data loss to the introductory paragraph about the
unsafe commands. Also move this paragraph next to the details of the unsafe
commands, below the section on the `retry_failed` flag.
Be more specific about how to use the URI parameters and in-body flags.
Clarify statements about when rebalancing takes place (i.e. it respects
settings)
Resolves#16113.
Today when a resize operation is performed, we copy the analysis,
similarity, and sort settings from the source index. It is possible for
the resize request to include additional index settings including
analysis, similarity, and sort settings. We reject sort settings when
validating the request. However, we silently ignore analysis and
similarity settings on the request that are already set on the source
index. Since it is possible to change the analysis and similarity
settings on an existing index, this should be considered a bug and the
sort of leniency that we abhor. This commit addresses this bug by
allowing the request analysis/similarity settings to override the
existing analysis/similarity settings on the target.
A previous change modified the output of the thread pool info contained
in the nodes info API. This commit adds a note to the migration docs for
this change.
A NullPointerException is thrown when trying to create or delete
a snapshot in a repository that has been written to by an older
Elasticsearch after writing to it with a newer Elasticsearch version.
This is because the way snapshots are formatted in the repository
snapshots index file changed in #24477.
This commit changes the parsing of the repository index file so that
it now detects a corrupted index file and fails early the snapshot
operation.
closes#29052
We already had *some* documentation of the batch nature of `reindex` and
friends but it wasn't super obvious how it interacted with the
`failures` element in the response. This adds some more documentation
the `failures` element.
Clearing the cache indices can be done via GET and POST. As GET should
only support read only operations, this removes the support for using
GET for clearing the indices caches.
Adding some allowed abbreviated values for intervals in date histograms
as well as documenting the limitations of intervals larger than days.
Closes#23294
The documentation for settings index.routing.allocation.enable,
index.routing.rebalance.enable and index.gc_deletes was lost in
f123a53d72. This change reinstates it.
* Clarify documentation of scroll_id
The Scroll API may return the same scroll ID for multiple requests due to server side state. This is not clear from the current documentation.
* Further clarify scroll ID return behaviour
This metric previously existed for backwards compatibility reasons
although the suggest stats were folded into search stats. This metric
was deprecated in 6.3.0 and this commit removes them for 7.0.0.
This commit adds the distribution type to the startup scripts so that we
can discern from log output and the main response the type of the
distribution (deb/rpm/tar/zip).
This commit adds the distribution flavor (default versus oss) to the
build process which is passed through the startup scripts to
Elasticsearch. This change will be used to customize the message on
attempting to install/remove x-pack based on the distribution flavor.
This commit makes x-pack a module and adds it to the default
distrubtion. It also creates distributions for zip, tar, deb and rpm
which contain only oss code.
The suggest stats were folded into the search stats as part of the
indices stats API in 5.0.0. However, the suggest metric remained as a
synonym for the search metric for BWC reasons. This commit deprecates
usage of the suggest metric on the indices stats API.
Similarly, due to the changes to fold the suggest stats into the search
stats, requesting the suggest index metric on the indices metric on the
nodes stats API has produced an empty object as the response since
5.0.0. This commit deprecates this index metric on the indices metric on
the nodes stats API.
The name of the bulk thread pool was renamed to "write" with "bulk" as a
fallback name. This change was made in 6.x for BWC reasons yet in 7.0.0
we are removing this fallback. This commit removes this fallback for the
write thread pool.
This commit renames the bulk thread pool to the write thread pool. This
is to better reflect the fact that the underlying thread pool is used to
execute any document write request (single-document index/delete/update
requests, and bulk requests).
With this change, we add support for fallback settings
thread_pool.bulk.* which will be supported until 7.0.0.
We also add a system property so that the display name of the thread
pool remains as "bulk" if needed to avoid breaking users.
Added an api that allows to execute an arbitrary script and a result to be returned.
```
POST /_scripts/painless/_execute
{
"script": {
"source": "params.var1 / params.var2",
"params": {
"var1": 1,
"var2": 1
}
}
}
```
Relates to #27875
* Add a CHANGELOG file for 7.x release notes.
* update file to include 6.x
* remove confusing comment and small edit to section title
* moving CHANGELOG file under docs directory, as it pertains to release notes.
Now that single-document indexing requests are executed on the bulk
thread pool the index thread pool is no longer needed. This commit
removes this thread pool from Elasticsearch.
As part of adding support for new API to the high-level REST client,
we added support for the `flat_settings` parameter to some of our
request classes. We added documentation that such flag is only ever
read by the high-level REST client, but the truth is that it doesn't
do anything given that settings are always parsed back into a `Settings`
object, no matter whether they are returned in a flat format or not.
It was a mistake to add support for this flag in the context of the
high-level REST client, hence this commit removes it.
We want to remove the index thread pool as it is no longer needed since
single-document indexing requests are executed as bulk requests
now. Analyze requests are also executed on the index thread pool though
and they need a thread pool to execute on. The bulk thread does not seem
like the right thread pool, let us keep that thread pool conceptually
for bulk requests and free for bulk requests. None of the existing
thread pools make sense for analyze requests either. The generic thread
pool would be a terrible choice since it has an unbounded queue and that
is a bad idea for user-facing APIs. This commit introduces a small by
default (size=1, queue_size=16) thread pool for analyze requests.
CRUD: Parsing changes for UpdateRequest (#29293)
Use `ObjectParser` to parse `UpdateRequest` so we reject unknown fields
and drop support for the `_fields` parameter because it was deprecated
in 5.x.
Control max size and count of warning headers
Add a static persistent cluster level setting
"http.max_warning_header_count" to control the maximum number of
warning headers in client HTTP responses.
Defaults to unbounded.
Add a static persistent cluster level setting
"http.max_warning_header_size" to control the maximum total size of
warning headers in client HTTP responses.
Defaults to unbounded.
With every warning header that exceeds these limits,
a message will be logged in the main ES log,
and any more warning headers for this response will be
ignored.
Historically, the bootstrap checks used 2048 as the minimum limit for
the maximum number of threads. This limit was guided by the fact that
the number of processors was artificially capped at 32. This limit was
removed in 6.0.0 and the minimum limit was raised to 4096 to accommodate
this. However, the docs were not updated and this commit addresses that
miss.
This adds an `include_type_name` option to the `indices.create`,
`indices.get_mapping` and `indices.put_mapping` APIs, which defaults to `true`.
When set to `false`, then mappings will be returned directly in the body of
the `indices.get_mapping` API, without keying them by the type name, the
`indices.create` will expect mappings directly under the `mappings` key, and
the `indices.put_mapping` will use `_doc` as a type name and fail if a `type`
is provided explicitly.
Relates #15613
This change validates that the `_search` request does not have trailing
tokens after the main object and fails the request with a parsing exception otherwise.
Closes#28995
Some features have been deprecated since `6.0` like the `_parent` field or the
ability to have multiple types per index. This allows to remove quite some
code, which in-turn will hopefully make it easier to proceed with the removal
of types.
From 7.0 on, using `delimited_payload_filter` should throw an error.
It was deprecated in 6.2 in favour of `delimited_payload` (#26625).
Relates to #27704
Today we report thread pool info using a common object. This means that
we use a shared set of terminology that is not consistent with the
terminology used to the configure thread pools. This holds in particular
for the minimum and maximum number of threads in the thread pool where
we use the following terminology:
thread pool info | fixed | scaling
min core size
max max size
A previous change addressed this for the nodes info API. This commit
changes the display of thread pool info in the cat thread pool API too
to be dependent on the type of the thread pool so that we can align the
terminology in the output of thread pool info with the terminology used
to configure a thread pool.
This improves the way similarities are plugged in in order to:
- reject the classic similarity on 7.x indices and emit a deprecation
warning otherwise
- reject unkwown parameters on 7.x indices and emit a deprecation
warning otherwise
Even though this breaks the plugin API, I'd like to backport to 7.x so
that users can get deprecation warnings when they are doing something
that will become unsupported in the future.
Closes#23208Closes#29035
I am not sure why we have this leniency for HTTP max content length, it
has been there since the beginning
(5ac51ee93f) with no explanation of its
source. That said, our philosophy today is different than the philosophy
of the past where Elasticsearch would be quite lenient in its handling
of settings and today we aim for predictability for both users and
us. This commit removes leniency in the parsing of
http.max_content_length.
Today this part of the documentation just says that Geo queries are not 100%
accurate, but in fact we can be more precise about which kinds of queries see
which kinds of error. This commit clarifies this point.
At time of writing, GeoJSON did not enforce a specific ordering of vertices in
a polygon, but it now does. We occasionally get reports of Elasticsearch
rejecting apparently-valid GeoJSON because of badly oriented polygons, and it's
helpful to be able to point at this bit of the documentation when responding.
As follow up to #28245 , this PR removes the logic for selecting the
right start commit from the Engine constructor in favor of explicitly
trimming them in the Store, before the engine is opened. This makes the
constructor in engine follow standard Lucene semantics and use the last
commit.
Relates #28245
Relates #29156
#28245 has introduced the utility class`EngineDiskUtils` with a set of methods to prepare/change
translog and lucene commit points. That util class bundled everything that's needed to create and
empty shard, bootstrap a shard from a lucene index that was just restored etc.
In order to safely do these manipulations, the util methods acquired the IndexWriter's lock. That
would sometime fail due to concurrent shard store fetching or other short activities that require the
files not to be changed while they read from them.
Since there is no way to wait on the index writer lock, the `Store` class has other locks to make
sure that once we try to acquire the IW lock, it will succeed. To side step this waiting problem, this
PR folds `EngineDiskUtils` into `Store`. Sadly this comes with a price - the store class doesn't and
shouldn't know about the translog. As such the logic is slightly less tight and callers have to do the
translog manipulations on their own.
This change refactors the composite aggregation to add an execution mode that visits documents in the order of the values
present in the leading source of the composite definition. This mode does not need to visit all documents since it can early terminate
the collection when the leading source value is greater than the lowest value in the queue.
Instead of collecting the documents in the order of their doc_id, this mode uses the inverted lists (or the bkd tree for numerics) to collect documents
in the order of the values present in the leading source.
For instance the following aggregation:
```
"composite" : {
"sources" : [
{ "value1": { "terms" : { "field": "timestamp", "order": "asc" } } }
],
"size": 10
}
```
... can use the field `timestamp` to collect the documents with the 10 lowest values for the field instead of visiting all documents.
For composite aggregation with more than one source the execution can early terminate as soon as one of the 10 lowest values produces enough
composite buckets. For instance if visiting the first two lowest timestamp created 10 composite buckets we can early terminate the collection since it
is guaranteed that the third lowest timestamp cannot create a composite key that compares lower than the one already visited.
This mode can execute iff:
* The leading source in the composite definition uses an indexed field of type `date` (works also with `date_histogram` source), `integer`, `long` or `keyword`.
* The query is a match_all query or a range query over the field that is used as the leading source in the composite definition.
* The sort order of the leading source is the natural order (ascending since postings and numerics are sorted in ascending order only).
If these conditions are not met this aggregation visits each document like any other agg.
The rank_eval documentation was missing an explanation of the parameter
`k` that controls the number of top hits that are used in the ranking evaluation.
Closes#29205
Adds docs for `HighLevelRestClient#multiSearch`. Unlike the `multiGet`
docs these are much more sparse because multi-search doesn't support
setting many options on the `MultiSearchRequest` and instead just wraps
a list of `SearchRequest`s.
Closes#28389
This enhancement adds Z value support (source only) to geo_shape fields. If vertices are provided with a third dimension, the third dimension is ignored for indexing but returned as part of source. Like beofre, any values greater than the 3rd dimension are ignored.
closes#23747
This commit adds a note to the low-level REST client docs regarding the
possibility of being impacted by the JVM DNS cache policy under a
default security manager policy.
This commit removes some parameters deprecated in 6.x (or 5.x):
`use_dismax`, `split_on_whitespace`, `all_fields` and `lowercase_expanded_terms`.
Closes#25551
This commit adds a new setting `cluster.persistent_tasks.allocation.enable`
that can be used to enable or disable the allocation of persistent tasks.
The setting accepts the values `all` (default) or `none`. When set to
none, the persistent tasks that are created (or that must be reassigned)
won't be assigned to a node but will reside in the cluster state with
a no "executor node" and a reason describing why it is not assigned:
```
"assignment" : {
"executor_node" : null,
"explanation" : "persistent task [foo/bar] cannot be assigned [no
persistent task assignments are allowed due to cluster settings]"
}
```
This will reject mapping updates to the `_default_` mapping with 7.x indices
and still emit a deprecation warning with 6.x indices.
Relates #15613
Supersedes #28248
Update allocation awareness docs
Today, the docs imply that if multiple attributes are specified the the
whole combination of values is considered as a single entity when
performing allocation. In fact, each attribute is considered separately. This
change fixes this discrepancy.
It also replaces the use of the term "awareness zone" with "zone or domain", and
reformats some paragraphs to the right width.
Fixes#29105
This is a follow up to a previous change which set the error file path
for the package distributions. The observation here is that we always
set the working directory of Elasticsearch to the root of the
installation (i.e., Elasticsearch home). Therefore, we can specify the
error file path relative to this directory and default it to the logs
directory, similar to the package distributions.
This is a follow up to a previous change which set the heap dump path
for the package distributions. The observation here is that we always
set the working directory of Elasticsearch to to the root of
installation (i.e., Elasticsearch home). Therefore, we can specify the
heap dump path relative to this directory and default it to the data
directory, similar to the package distributions.
Adds support for triple quoted strings to the documentation test
generator. Kibana's CONSOLE tool has supported them for a year but we
were unable to use them in Elasticsearch's docs because the process that
converts example snippets into tests couldn't handle this. This change
adds code to convert them into standard JSON so we can pass them to
Elasticsearch.
I have seen this question a couple times already, most recently at
https://twitter.com/dimosr7/status/973872744965332993
I tried to keep the explanation as simple as I could, which is not always easy
as this is a matter of trade-offs.
Additionally:
* Included the existing update by query java api docs in java-api docs.
(for some reason it was never included, it needed some tweaking and
then it was good to go)
* moved delete-by-query / update-by-query code samples to java file so
that we can verify that these samples at least compile.
Closes#24203
The Java API documentation for index administration currenty is wrong because
the PutMappingRequestBuilder#setSource(Object... source) an
CreateIndexRequestBuilder#addMapping(String type, Object... source) methods
delegate to methods that check that the input arguments are valid key/value
pairs. This changes the docs so the java api code examples are included from
documentation integration tests so we detect compile and runtime issues earlier.
Closes#28131
By the time the master branch is released the deprecated url
parameters in the `/_cache/clear` API will have been deprecated
for a couple of minor releases. Since master will be the next
major release we are fine with removing these parameters.
Currently we have a fairly complicated logic in the engine constructor logic to deal with all the
various ways we want to mutate the lucene index and translog we're opening.
We can:
1) Create an empty index
2) Use the lucene but create a new translog
3) Use both
4) Force a new history uuid in all cases.
This leads complicated code flows which makes it harder and harder to make sure we cover all the
corner cases. This PR tries to take another approach. Constructing an InternalEngine always opens
things as they are and all needed modifications are done by static methods directly on the
directory, one at a time.
We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.
We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.
Which allows running:
```
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information. Used to parse pdf and office files",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars_field" : "size"
}
}
]
}
```
Then index either:
```
PUT index/doc/1?pipeline=attachment
{
"data": "BASE64"
}
```
Which will use the default value (or the one defined by `indexed_chars`)
Or
```
PUT index/doc/2?pipeline=attachment
{
"data": "BASE64",
"size": 1000
}
```
Closes#28942
* Add a REST integration test that documents date_range support
Add a test case that exercises date_range aggregations using the missing
option.
Addresses #17597
* Test cleanup and correction
Adding a document with a null date to exercise `missing` option, update
test name to something reasonable.
* Update documentation to explain how the "missing" parameter works for
date_range aggregations.
* Wrap lines at 80 chars in docs.
* Change format of test to YAML for readability.
The current docs on [Indices APIs: PUT Mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html) suggests that a having number of different mapping types per index is still possible in elasticsearch versions > 6.0.0 although they have been [removed](https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html). The console code has already been updated accordingly but notes (2) and (3) on the console code still name the `user` mapping type.
This PR updates the list with notes after the console code, as well as the first sentence of the docs
to avoid confusion. Also, I have removed the second command from the console code as it no
longer holds any value if the docs are solely on the `_doc` mapping.
The docs state that `_gce_` is recommended but the code sample states
that `_gce:hostname_` is recommended. This aligns the code sample with
the documentation. Also replace `type` with `zen.hosts_provider` as
discovery.type was removed in #25080.
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.
Closes#28782
The original example resulted in a 400 error due to the example being `-` separated instead of the default `.` separation.
```
failed to parse date field [2001-01-01] with format [YYYY.MM.dd]
```
Adds a usage example of the JLH score used in significant terms aggregation.
All other methods to calculate significance score have such an example
Closes#28513
Increase the default limit of `index.highlight.max_analyzed_offset` to 1M instead of previous 10K.
Enhance an error message when offset increased to include field name, index name and doc_id.
Relates to https://github.com/elastic/kibana/issues/16764
* Clarifies how the query_string splits textual part to build a query
Whitespaces are not considered as operators anymore in 6x but the documentation is not clear about it.
This commit changes the example in the documentation and adds a note regarding whitespaces and operators.
Closes#28719