As follow up to #28245 , this PR removes the logic for selecting the
right start commit from the Engine constructor in favor of explicitly
trimming them in the Store, before the engine is opened. This makes the
constructor in engine follow standard Lucene semantics and use the last
commit.
Relates #28245
Relates #29156
#28245 has introduced the utility class`EngineDiskUtils` with a set of methods to prepare/change
translog and lucene commit points. That util class bundled everything that's needed to create and
empty shard, bootstrap a shard from a lucene index that was just restored etc.
In order to safely do these manipulations, the util methods acquired the IndexWriter's lock. That
would sometime fail due to concurrent shard store fetching or other short activities that require the
files not to be changed while they read from them.
Since there is no way to wait on the index writer lock, the `Store` class has other locks to make
sure that once we try to acquire the IW lock, it will succeed. To side step this waiting problem, this
PR folds `EngineDiskUtils` into `Store`. Sadly this comes with a price - the store class doesn't and
shouldn't know about the translog. As such the logic is slightly less tight and callers have to do the
translog manipulations on their own.
This change refactors the composite aggregation to add an execution mode that visits documents in the order of the values
present in the leading source of the composite definition. This mode does not need to visit all documents since it can early terminate
the collection when the leading source value is greater than the lowest value in the queue.
Instead of collecting the documents in the order of their doc_id, this mode uses the inverted lists (or the bkd tree for numerics) to collect documents
in the order of the values present in the leading source.
For instance the following aggregation:
```
"composite" : {
"sources" : [
{ "value1": { "terms" : { "field": "timestamp", "order": "asc" } } }
],
"size": 10
}
```
... can use the field `timestamp` to collect the documents with the 10 lowest values for the field instead of visiting all documents.
For composite aggregation with more than one source the execution can early terminate as soon as one of the 10 lowest values produces enough
composite buckets. For instance if visiting the first two lowest timestamp created 10 composite buckets we can early terminate the collection since it
is guaranteed that the third lowest timestamp cannot create a composite key that compares lower than the one already visited.
This mode can execute iff:
* The leading source in the composite definition uses an indexed field of type `date` (works also with `date_histogram` source), `integer`, `long` or `keyword`.
* The query is a match_all query or a range query over the field that is used as the leading source in the composite definition.
* The sort order of the leading source is the natural order (ascending since postings and numerics are sorted in ascending order only).
If these conditions are not met this aggregation visits each document like any other agg.
The rank_eval documentation was missing an explanation of the parameter
`k` that controls the number of top hits that are used in the ranking evaluation.
Closes#29205
Adds docs for `HighLevelRestClient#multiSearch`. Unlike the `multiGet`
docs these are much more sparse because multi-search doesn't support
setting many options on the `MultiSearchRequest` and instead just wraps
a list of `SearchRequest`s.
Closes#28389
This enhancement adds Z value support (source only) to geo_shape fields. If vertices are provided with a third dimension, the third dimension is ignored for indexing but returned as part of source. Like beofre, any values greater than the 3rd dimension are ignored.
closes#23747
This commit adds a note to the low-level REST client docs regarding the
possibility of being impacted by the JVM DNS cache policy under a
default security manager policy.
This commit removes some parameters deprecated in 6.x (or 5.x):
`use_dismax`, `split_on_whitespace`, `all_fields` and `lowercase_expanded_terms`.
Closes#25551
This commit adds a new setting `cluster.persistent_tasks.allocation.enable`
that can be used to enable or disable the allocation of persistent tasks.
The setting accepts the values `all` (default) or `none`. When set to
none, the persistent tasks that are created (or that must be reassigned)
won't be assigned to a node but will reside in the cluster state with
a no "executor node" and a reason describing why it is not assigned:
```
"assignment" : {
"executor_node" : null,
"explanation" : "persistent task [foo/bar] cannot be assigned [no
persistent task assignments are allowed due to cluster settings]"
}
```
This will reject mapping updates to the `_default_` mapping with 7.x indices
and still emit a deprecation warning with 6.x indices.
Relates #15613
Supersedes #28248
Update allocation awareness docs
Today, the docs imply that if multiple attributes are specified the the
whole combination of values is considered as a single entity when
performing allocation. In fact, each attribute is considered separately. This
change fixes this discrepancy.
It also replaces the use of the term "awareness zone" with "zone or domain", and
reformats some paragraphs to the right width.
Fixes#29105
This is a follow up to a previous change which set the error file path
for the package distributions. The observation here is that we always
set the working directory of Elasticsearch to the root of the
installation (i.e., Elasticsearch home). Therefore, we can specify the
error file path relative to this directory and default it to the logs
directory, similar to the package distributions.
This is a follow up to a previous change which set the heap dump path
for the package distributions. The observation here is that we always
set the working directory of Elasticsearch to to the root of
installation (i.e., Elasticsearch home). Therefore, we can specify the
heap dump path relative to this directory and default it to the data
directory, similar to the package distributions.
Adds support for triple quoted strings to the documentation test
generator. Kibana's CONSOLE tool has supported them for a year but we
were unable to use them in Elasticsearch's docs because the process that
converts example snippets into tests couldn't handle this. This change
adds code to convert them into standard JSON so we can pass them to
Elasticsearch.
I have seen this question a couple times already, most recently at
https://twitter.com/dimosr7/status/973872744965332993
I tried to keep the explanation as simple as I could, which is not always easy
as this is a matter of trade-offs.
Additionally:
* Included the existing update by query java api docs in java-api docs.
(for some reason it was never included, it needed some tweaking and
then it was good to go)
* moved delete-by-query / update-by-query code samples to java file so
that we can verify that these samples at least compile.
Closes#24203
The Java API documentation for index administration currenty is wrong because
the PutMappingRequestBuilder#setSource(Object... source) an
CreateIndexRequestBuilder#addMapping(String type, Object... source) methods
delegate to methods that check that the input arguments are valid key/value
pairs. This changes the docs so the java api code examples are included from
documentation integration tests so we detect compile and runtime issues earlier.
Closes#28131
By the time the master branch is released the deprecated url
parameters in the `/_cache/clear` API will have been deprecated
for a couple of minor releases. Since master will be the next
major release we are fine with removing these parameters.
Currently we have a fairly complicated logic in the engine constructor logic to deal with all the
various ways we want to mutate the lucene index and translog we're opening.
We can:
1) Create an empty index
2) Use the lucene but create a new translog
3) Use both
4) Force a new history uuid in all cases.
This leads complicated code flows which makes it harder and harder to make sure we cover all the
corner cases. This PR tries to take another approach. Constructing an InternalEngine always opens
things as they are and all needed modifications are done by static methods directly on the
directory, one at a time.
We today support a global `indexed_chars` processor parameter. But in some cases, users would like to set this limit depending on the document itself.
It used to be supported in mapper-attachments plugin by extracting the limit value from a meta field in the document sent to indexation process.
We add an option which reads this limit value from the document itself
by adding a setting named `indexed_chars_field`.
Which allows running:
```
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information. Used to parse pdf and office files",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars_field" : "size"
}
}
]
}
```
Then index either:
```
PUT index/doc/1?pipeline=attachment
{
"data": "BASE64"
}
```
Which will use the default value (or the one defined by `indexed_chars`)
Or
```
PUT index/doc/2?pipeline=attachment
{
"data": "BASE64",
"size": 1000
}
```
Closes#28942
* Add a REST integration test that documents date_range support
Add a test case that exercises date_range aggregations using the missing
option.
Addresses #17597
* Test cleanup and correction
Adding a document with a null date to exercise `missing` option, update
test name to something reasonable.
* Update documentation to explain how the "missing" parameter works for
date_range aggregations.
* Wrap lines at 80 chars in docs.
* Change format of test to YAML for readability.
The current docs on [Indices APIs: PUT Mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html) suggests that a having number of different mapping types per index is still possible in elasticsearch versions > 6.0.0 although they have been [removed](https://www.elastic.co/guide/en/elasticsearch/reference/current/removal-of-types.html). The console code has already been updated accordingly but notes (2) and (3) on the console code still name the `user` mapping type.
This PR updates the list with notes after the console code, as well as the first sentence of the docs
to avoid confusion. Also, I have removed the second command from the console code as it no
longer holds any value if the docs are solely on the `_doc` mapping.
The docs state that `_gce_` is recommended but the code sample states
that `_gce:hostname_` is recommended. This aligns the code sample with
the documentation. Also replace `type` with `zen.hosts_provider` as
discovery.type was removed in #25080.
With this commit we reduce heap usage of the ingest-geoip plugin by
memory-mapping the database files. Previously, we have stored these
files gzip-compressed but this has resulted that data are loaded on the
heap.
Closes#28782
The original example resulted in a 400 error due to the example being `-` separated instead of the default `.` separation.
```
failed to parse date field [2001-01-01] with format [YYYY.MM.dd]
```
Adds a usage example of the JLH score used in significant terms aggregation.
All other methods to calculate significance score have such an example
Closes#28513
Increase the default limit of `index.highlight.max_analyzed_offset` to 1M instead of previous 10K.
Enhance an error message when offset increased to include field name, index name and doc_id.
Relates to https://github.com/elastic/kibana/issues/16764
* Clarifies how the query_string splits textual part to build a query
Whitespaces are not considered as operators anymore in 6x but the documentation is not clear about it.
This commit changes the example in the documentation and adds a note regarding whitespaces and operators.
Closes#28719
Values for the network.host setting can often contain a colon which is a
character that is considered special by YAML (these arise in IPv6
addresses and some of the special tags like ":ipv4"). As such, these
values need to be quoted or a YAML parser will be unhappy with
them. This commit adds a note to the docs regarding this.
* Reject regex search if regex string is too long (#28344)
* Add docs
* Introduce index level setting `index.max_regex_length`
to control the maximum length of the regular expression
Closes#28344
Similarly to what has been done for s3 and azure, this commit removes
the repository settings `application_name` and `connect/read_timeout`
in favor of client settings. It introduce a GoogleCloudStorageClientSettings
class (similar to S3ClientSettings) and a bunch of unit tests for that,
it aligns the documentation to be more coherent with the S3 one, it
documents the connect/read timeouts that were not documented at all and
also adds a new client setting that allows to define a custom endpoint.
We previously specified the -server flag to force the JVM to use the
server JVM. This is the default on all the systems that we support when
using a 64-bit JVM (and we no longer support 32-bit JVMs). There was
some trouble with this flag for the Windows service since procrun did
not understand what to do with it; as such, we had to filter this flag
out in the service. When we migrated to parsing JVM options in Java (via
the JVM options parser) we simplified this situation and removed
specifying the -server flag. This commit removes a leftover statement
that we are forcing the server JVM.
Relates #28738
The node stats API enables filtlering the top-level stats for only
desired top-level stats. Yet, this was never enabled for adaptive
replica selection stats. This commit enables this. We also add setting
these stats on the request builder, and fix an inconsistent name in a
setter.
Relates #28721
The Windows service will use a private temporary directory under the
user that is performing the installation. In cases when the service will
run as a different user, operators need a method to set this temporary
directory elsewhere. We have such a mechanism, so this commit merely
adds a note to the documentation on how to utilize it.
Relates #28712
Elasticsearch 6.x indices do not allow multiple types index. Instead, they use "_doc" as default if created internally (Elasticsearch), or "doc" default if sent by Logstash.
Currently the Translog constructor is capable both of opening an existing translog and creating a
new one (deleting existing files). This PR separates these two into separate code paths. The
constructors opens files and a dedicated static methods creates an empty translog.