Restoring from a snapshot (which is a particular form of recovery) does not currently take recovery throttling into account
(i.e. the `indices.recovery.max_bytes_per_sec` setting). While restores are subject to their own throttling (repository
setting `max_restore_bytes_per_sec`), this repository setting does not allow for values to be configured differently on a
per-node basis. As restores are very similar in nature to peer recoveries (streaming bytes to the node), it makes sense to
configure throttling in a single place.
The `max_restore_bytes_per_sec` setting is also changed to default to unlimited now, whereas previously it was set to
`40mb`, which is the current default of `indices.recovery.max_bytes_per_sec`). This means that no behavioral change
will be observed by clusters where the recovery and restore settings were not adapted.
Relates https://github.com/elastic/elasticsearch/issues/57023
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Today the disk-based shard allocator accounts for incoming shards by
subtracting the estimated size of the incoming shard from the free space on the
node. This is an overly conservative estimate if the incoming shard has almost
finished its recovery since in that case it is already consuming most of the
disk space it needs.
This change adds to the shard stats a measure of how much larger each store is
expected to grow, computed from the ongoing recovery, and uses this to account
for the disk usage of incoming shards more accurately.
Backport of #58029 to 7.x
* Picky picky
* Missing type
Adds an explicit check to `variable_width_histogram` to stop it from
trying to collect from many buckets because it can't. I tried to make it
do so but that is more than an afternoon's project, sadly. So for now we
just disallow it.
Relates to #42035
This commit adds conditional logic to the docs to avoid including any
docs on searchable snapshots in released versions.
Rework of #58556 which was reverted.
Adds an API for putting an index block in place, which also ensures for write blocks that, once successfully returning to
the user, all shards of the index are properly accounting for the block, for example that all in-flight writes to an index have
been completed after adding the write block.
This API allows coordinating more complex workflows, where it is crucial that an index is no longer receiving writes after
the API completes, useful for example when marking an index as read-only during an upgrade in order to reindex its
documents.
* Adding create index snapshot API page.
* Condense API description.
* Remove parameter from query.
* Add POST method and remove `-name` from the snapshot variable.
* Expand description of `<snapshot>`.
* Add data streams to introduction and expand the overall description.
* Add support for data streams.
* Add support for data streams.
* Add data stream and reference for "point-in-time view".
* Add data streams.
* Change `my_backup` to `my_repository`.
* Add description of boolean options for `wait_for_completion` parameter.
* Change command --> response
* Clarify `indices` parameter description
* Update `ignore-unavailable` parameter description
* Reword example description
* Remove "index" from API name
* Incorporating review comments from James R.
* Adding a much better request + response
* Clarify `include_global_state` description
* Incorporating additional edits.
* Changing my_backup to my_repository in example.
* Update snippet test to avoid failures
* Update TESTRESPONSE snippets
* Remove errant space
* Removing the parameter per reviewer comments
Adds parsing of `status` and `memory_reestimate_bytes`
to data frame analytics `memory_usage`. When the training surpasses
the model memory limit, the status will be set to `hard_limit` and
`memory_reestimate_bytes` can be used to update the job's
limit in order to restart the job.
Backport of #58588
Introduce pipe support, in particular head and tail
(which can also be chained).
(cherry picked from commit 4521ca3367147d4d6531cf0ab975d8d705f400ea)
(cherry picked from commit d6731d659d012c96b19879d13cfc9e1eaf4745a4)
Replaces `composable index template` and `composable template` with
`index template` throughout data stream-related docs.
`Composable index template` is only used to contrast with legacy index
templates.
Removes references to partial results from the async EQL search docs.
If an EQL search does not complete during the `wait_for_completion_timeout`
timeout period, it returns no results.
Changes the titles for tokenizer pages to sentence case.
Also moves the 'Path hierarchy tokenizer examples' page within the
'Path hierarchy tokenizer' page and adds a related redirect.
We're tracking this aggregation's experimental-progress in #58573. We'd
like a little time to be able to make backwards incompatible changes to
the aggregation because we're not 100% sure about the request and
response format yet.
Today we have individual settings for configuring node roles such as
node.data and node.master. Additionally, roles are pluggable and we have
used this to introduce roles such as node.ml and node.voting_only. As
the number of roles is growing, managing these becomes harder for the
user. For example, to create a master-only node, today a user has to
configure:
- node.data: false
- node.ingest: false
- node.remote_cluster_client: false
- node.ml: false
at a minimum if they are relying on defaults, but also add:
- node.master: true
- node.transform: false
- node.voting_only: false
If they want to be explicit. This is also challenging in cases where a
user wants to have configure a coordinating-only node which requires
disabling all roles, a list which we are adding to, requiring the user
to keep checking whether a node has acquired any of these roles.
This commit addresses this by adding a list setting node.roles for which
a user has explicit control over the list of roles that a node has. If
the setting is configured, the node has exactly the roles in the list,
and not any additional roles. This means to configure a master-only
node, the setting is merely 'node.roles: [master]', and to configure a
coordinating-only node, the setting is merely: 'node.roles: []'.
With this change we deprecate the existing 'node.*' settings such as
'node.data'.
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.
This PR addresses #9572.
The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.
At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.
The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.
Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue.
It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.
Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
With #58096, data streams now track the timestamp field mapping outside
of the template associated with the stream. This means you can no longer
update the timestamp field mapping using template changes.
This updates the associated data stream docs.
Introduces a new method on `MappedFieldType` to return a family type name which defaults to the field type.
Changes `wildcard` and `constant_keyword` field types to return `keyword` for field capabilities.
Relates to #53175
When a local model is constructed, the cache hit miss count is incremented.
When a user calls _stats, we will include the sum cache hit miss count across ALL nodes. This statistic is important to in comparing against the inference_count. If the cache hit miss count is near the inference_count it indicates that the cache is overburdened, or inappropriately configured.
Today when creating a follower index via the put follow API, or via an
auto-follow pattern, it is not possible to specify settings overrides
for the follower index. Instead, we copy all of the leader index
settings to the follower. Yet, there are cases where a user would want
some different settings on the follower index such as the number of
replicas, or allocation settings. This commit addresses this by allowing
the user to specify settings overrides when creating follower index via
manual put follower calls, or via auto-follow patterns. Note that not
all settings can be overrode (e.g., index.number_of_shards) so we also
have detection that prevents attempting to override settings that must
be equal between the leader and follow index. Note that we do not even
allow specifying such settings in the overrides, even if they are
specified to be equal between the leader and the follower
index. Instead, the must be implicitly copied from the leader index, not
explicitly set by the user.
TIME_PARSE works correctly if both date and time parts are specified,
and a TIME object (that contains only time is returned).
Adjust docs and add a unit test that validates the behavior.
Follows: #55223
(cherry picked from commit 9d6b679a5da88f3c131b9bdba49aa92c6c272abe)
Changes:
* Adds additional examples to the `Search a data stream` section of
`Use a data stream`
* Updates existing search docs to make them aware of data streams
This change allows to use an `index_filter` in the
field capabilities API. Indices are filtered from
the response if the provided query rewrites to `match_none`
on every shard:
````
GET metrics-*
{
"index_filter": {
"bool": {
"must": [
"range": {
"@timestamp": {
"gt": "2019"
}
}
}
}
}
````
The filtering is done on a best-effort basis, it uses the can match phase
to rewrite queries to `match_none` instead of fully executing the request.
The first shard that can match the filter is used to create the field
capabilities response for the entire index.
Closes#56195
This allows doing true CAS operations on aliases, making sure that an alias is actually properly
moved from a given source index onto a given target index. This is useful to ensure that an
alias is actually moved from a given index to another one, and not just added to another index.
* Adding documentation for near real-time search.
* Adding link to NRT topic and clarifying some text.
* Adding diagrams and incorporating changes from David T.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
(cherry picked from commit 25cbbe56dd29fbee2efe8040e9c8b92d168cb670)
Signed-off-by: Andrei Dan <andrei.dan@elastic.co>
Adds corresponding steps for the get and delete data stream APIs to the
data stream setup tutorial. Also provides some guidance on how to
determine the current write index for a data stream.
This commit clarifies that the `expand_wildcards` option (as well as other
`IndicesOptions` parameters) can be used with the Create Snapshot API, but that
they must be in the body of the request.
Also clarifies the connection between `expand_wildcards` and hidden indices as
it relates to snapshots.
Changes:
* Adds new 'Reindex with a data stream' section to 'Use a data stream'
* Makes the existing reindex API docs aware of data streams
* Rewrites the reindex glossary definition to include data streams
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
Co-authored-by: Tim Vernum <tim@adjective.org>
Co-authored-by: lcawl <lcawley@elastic.co>
Backport of #57870 to 7.x branch.
This change now also copies the op_type from the reindex request's destination index request to the actual index request being used in the bulk request.
For ensuring no document exists, the op_type create doesn't need to be copied, since Versions.MATCH_DELETED will copied from the 'mainRequest.getDestination().version()'.
The `version()` method on IndexRequest only returns Versions.MATCH_DELETED if op_type=create and no specific version has been specified.
However in order to be able to index into a data stream, the op_type must be create. So in order to support that the op_type must be copied from the reindex request's destination index request to the actual index request being used in the bulk request.
Relates to #53100 and #57788
Changes:
* Updates 'Data streams' intro page to focus on problem solution and
benefits.
* Adds 'Data streams overview' page to cover conceptual information,
based on existing content in the 'Data streams' intro.
* Adds diagrams for data streams and search/indexing request examples.
* Moves API jump list and API docs to a new 'Data streams APIs' section.
Links to these APIs will be available through tutorials.
* Add xrefs to existing docs for concepts like generation, write index,
and append-only.
Changes:
* Condenses and relocates the `docvalue_fields` example to the 'Run a search'
page.
* Adds docs for the `docvalue_fields` request body parameter.
* Updates several related xrefs.
Co-authored-by: debadair <debadair@elastic.co>
We document that the cluster state API is an internal representation which may
change, but apparently not emphatically enough. This commit adds a `NOTE:`
admonition to this paragraph.
Per 49554 I added standard deviation sampling and variance sampling to the extended stats interface.
Closes#49554
Co-authored-by: Igor Motov <igor@motovs.org>
Co-authored-by: andrewjohnson2 <aj114114@gmail.com>
Creates a new page for a 'Set up a data stream' tutorial, based on
existing content in 'Data streams'.
Also adds tutorials for:
* Configuring an ILM policy for a data stream
* Indexing documents to a data stream
* Searching a data stream
* Manually rolling over a data stream
When Joni, the regex engine that powers grok emits a warning it
does so by default to System.err. System.err logs are all bucketed
together in the server log at WARN level. When Joni emits a warning,
it can be extremely verbose, logging a message for each execution
again that pattern. For ingest node that means for every document
that is run that through Grok. Fortunately, Joni provides a call
back hook to push these warnings to a custom location.
This commit implements Joni's callback hook to push the Joni warning
to the Elasticsearch server logger (logger.org.elasticsearch.ingest.common.GrokProcessor)
at debug level. Generally these warning indicate a possible issue with
the regular expression and upon creation of the Grok processor will
do a "test run" of the expression and log the result (if any) at WARN
level. This WARN level log should only occur on pipeline creation which
is a much lower frequency then every document.
Additionally, the documentation is updated with instructions for how
to set the logger to debug level.
Cleans up the reference documentation for the following
search API parameters:
* `_source` query parameter
* `_source_excludes` query parameter
* `_source_includes` query parameter
* `_source` request body parameter
* `hits._source` response property
Changes:
* Adds title abbreviation
* Adds Lucene link to description
* Adds standard headings
* Simplifies analyze example
* Simplifies analyzer example and adds contextual text
This adds new plugin level circuit breaker for the ML plugin.
`model_inference` is the circuit breaker qualified name.
Right now it simply adds to the breaker when the model is loaded (and possibly breaking) and removing from the breaker when the model is unloaded.
Deleting expired data can take a long time leading to timeouts if there
are many jobs. Often the problem is due to a few large jobs which
prevent the regular maintenance of the remaining jobs. This change adds
a job_id parameter to the delete expired data endpoint to help clean up
those problematic jobs.
This PR adds the initial Java side changes to enable
use of the per-partition categorization functionality
added in elastic/ml-cpp#1293.
There will be a followup change to complete the work,
as there cannot be any end-to-end integration tests
until elastic/ml-cpp#1293 is merged, and also
elastic/ml-cpp#1293 does not implement some of the
more peripheral functionality, like stop_on_warn and
per-partition stats documents.
The changes so far cover REST APIs, results object
formats, HLRC and docs.
Backport of #57683
Adds some guidance for designing clusters to be resilient to
failures, including example architectures.
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: David Turner <david.turner@elastic.co>
* Clarifying environment variable substitution in the ES configuration YAML
* Update code snippet
* Remove extraneous quotes from string example
* Incorporating review feedback
When we force delete a DF analytics job, we currently first force
stop it and then we proceed with deleting the job config.
This may result in logging errors if the job config is deleted
before it is retrieved while the job is starting.
Instead of force stopping the job, it would make more sense to
try to stop the job gracefully first. So we now try that out first.
If normal stop fails, then we resort to force stopping the job to
ensure we can go through with the delete.
In addition, this commit introduces `timeout` for the delete action
and makes use of it in the child requests.
Backport of #57680
Moves the source filtering example snippets form the "Request body
search" API docs page to the "Return fields in a search" section of the
"Run a search" page.
In #55592 and #55416, we deprecated the settings for enabling and disabling
basic license features and turned those settings into no-ops. Since doing so,
we've had feedback that this change may not give users enough time to cleanly
switch from non-ILM index management tools to ILM. If two index managers
operate simultaneously, results could be strange and difficult to
reconstruct. We don't know of any cases where SLM will cause a problem, but we
are restoring that setting as well, to be on the safe side.
This PR is not a strict commit reversion. First, we are keeping the new
xpack.watcher.use_ilm_index_management setting, introduced when
xpack.ilm.enabled was made a no-op, so that users can begin migrating to using
it. Second, the SLM setting was modified in the same commit as a group of other
settings, so I have taken just the changes relating to SLM.
The create index action name (`indices:admin/create`) can no longer be used to grant privileges to auto create indices and instead the `create_index` builtin privilege should be used.
Relates to #55858
Co-authored-by: Jake Landis <jake.landis@elastic.co>
This PR adds a section to the new 'run a search' reference that explains
the options for returning fields. Previously each option was only listed as a
separate request parameter and it was hard to know what was available.
Add `TRIM` function which combines the functionality of both
`LTRIM` and `RTRIM` by stripping both leading and trailing
whitespaces.
Refers to #41195
(cherry picked from commit 6c86c919e12f0c4cb5e39d129aa65ab3e274268f)
This commit highlights the ability for geo_point fields to be
used in geo_shape queries. It also adds an explicit geo_point
example in the geo_shape query documentation
Closes#56927.
Several APIs support options that can be specified as a query parameter or a
request body parameter.
Currently, this is documented using notes, which can get rather lengthy. This
replaces those multiple notes with a single note and a footnote.
Add basic support for `TOP X` as a synonym to LIMIT X which is used
by [MS-SQL server](https://docs.microsoft.com/en-us/sql/t-sql/queries/top-transact-sql?view=sql-server-ver15),
e.g.:
```
SELECT TOP 5 a, b, c FROM test
```
TOP in SQL server also supports the `PERCENTAGE` and `WITH TIES`
keywords which this implementation doesn't.
Don't allow usage of both TOP and LIMIT in the same query.
Refers to #41195
(cherry picked from commit 2f5ab81b9ad884434d1faa60f4391f966ede73e8)
Generally we don't advocate for using `stored_fields`, and we're interested in
eventually removing the need for this parameter. So it's best to avoid using
stored fields in our docs examples when it's not actually necessary.
Individual changes:
* Avoid using 'stored_fields' in our docs.
* When defining script fields in top-hits, de-emphasize stored fields.
Reworks the `from / size` content to `Paginate search results`.
Moves those docs from the request body search API page (slated for
deletion) to the `Run a search` tutorial docs.
Also adds some notes to the `from` and `size` param docs.
Co-authored-by: debadair <debadair@elastic.co>
The old description mentions a setting that we ended up not merging.
The periodic real-memory checks are automatic and do not require
the user to configure any setting.
**Goal**
Create a top-level search section. This will let us clean up our search
API reference docs, particularly content from [`Request body search`][0].
**Changes**
* Creates a top-level `Search your data` page. This page is designed to
house concept and tutorial docs related to search.
* Creates a `Run a search` page under `Search your data`. For now, This
contains a basic search tutorial. The goal is to add content from
[`Request body search`][0] to this in the future.
* Relocates `Long-running searches` and `Search across clusters` under
`Search your data`. Increments several headings in that content.
* Reorders the top-level TOC to move `Search your data` higher. Also
moves the `Query DSL`, `EQL`, and `SQL access` chapters immediately
after.
Relates to #48194
[0]: https://www.elastic.co/guide/en/elasticsearch/reference/master/search-request-body.html
* Moves `Discovery and cluster formation` content from `Modules` to
`Set up Elasticsearch`.
* Combines `Adding and removing nodes` with `Adding nodes to your
cluster`. Adds related redirect.
* Removes and redirects the `Modules` page.
* Rewrites parts of `Discovery and cluster formation` to remove `module`
references and meta references to the section.
Relocates the "Remote Clusters" documentation from the "Modules" section to the "Set up Elasticsearch" section.
Supporting changes:
* Reorders the "Bootstrap checks for X-Pack" section to immediately follow the "Bootstrap checks"chapter.
* Removes an outdated X-Pack `idef` from the "Remote Clusters" intro.
* Make it more clear that you can use `month` or `1M`.
* Explain rounding rules
* Consistently use "time zone" instead of "timezone". It looks like both
are right but I see "time zone" much more. And the parameter in
elasticsearch is `time_zone` so we may as well line up.
Closes#56760
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
This commit adds support for rules with multiple tokens on LHS, also
known as "contraction rules", into stemmer override token
filter. Contraction rules are handy into translating multiple
inflected words into the same root form. One side effect of this change is
that it brings stemmer override rules format closer to synonym rules
format so that it makes it easier to translate one into another.
This change also makes stemmer override rules parser more strict so
that it should catch more errors which were previously accepted.
Closes#56113
* [ML] adds new for_export flag to GET _ml/inference API (#57351)
Adds a new boolean flag, `for_export` to the `GET _ml/inference/<model_id>` API.
This flag is useful for moving models between clusters.
This commit adds the `expand_wildcards` parameter documentation to the
`_cat/indices` and `_cat/aliases` docs, as those APIs now support
`expand_wildcards`. Additionally, clarifies the `expand_wildcards` docs with
respect to hidden indices.
This adds a max_model_memory setting to forecast requests.
This setting can take a string value that is formatted according to byte sizes (i.e. "50mb", "150mb").
The default value is `20mb`.
There is a HARD limit at `500mb` which will throw an error if used.
If the limit is larger than 40% the anomaly job's configured model limit, the forecast limit is reduced to be strictly lower than that value. This reduction is logged and audited.
related native change: https://github.com/elastic/ml-cpp/pull/1238
closes: https://github.com/elastic/elasticsearch/issues/56420
Implement TIME_PARSE(<time_str>, <pattern_str>) function
which allows to parse a time string according to the specified
pattern into a time object. The patterns allowed are those of
java.time.format.DateTimeFormatter.
Closes#54963
Co-authored-by: Andrei Stefan <astefan@users.noreply.github.com>
Co-authored-by: Patrick Jiang(白泽) <patrickjiang0530@gmail.com>
(cherry picked from commit 1fe1188d449cad7d0782a202372edc52a4014135)
Backporting #56888 to 7.x branch.
Limit the creation of data streams only for namespaces that have a composable template with a data stream definition.
This way we ensure that mappings/settings have been specified and will be used at data stream creation and data stream rollover.
Also remove `timestamp_field` parameter from create data stream request and
let the create data stream api resolve the timestamp field
from the data stream definition snippet inside a composable template.
Relates to #53100
Changes:
* Rewrites description and adds a Lucene link
* Reformats the configurable parameters as a definition list
* Changes the `Theory` heading to `Using the min_hash token filter for
similarity search`
* Adds some additional detail to the analyzer example
This saves memory when running numeric significant terms which are not
at the top level by merging its collection into numeric terms and relying
on the optimization that we made in #55873.
As discussed at https://elastic.slack.com/archives/C0D1XEXEZ/p1586939752242300 is it not possible to restore snapshots taken on newer versions into clusters running lower versions. For example, a snapshot created in a 7.6.0 cluster cannot be restored on a 7.5.0 cluster. This needs to be documented.
Makes following changes to better clarify docs for read-only URL
snapshot repositories:
* Adds an example snippet for registering a URL repository
* Rewrites the protocols paragraph
* Adds a note to explicitly point out that only URLs using the `ftp`,
`http`, `http`, and `jar` protocols do not need the `path.repo`
setting.
Fixes#16280
Previously, the restore API snippet included a `include_global_state` value of `true`.
Some users copy and paste the code example verbatim, updating only the index and
snapshot value names. Running the snippet could inadvertently wipe out a
cluster's current ILM policies, index templates, and ingest pipelines.
This change updates the snippet to use a `include_global_state` value of
`false`. It also adds a callout that better describes impacts of
using a `include_global_state` argument of `true`.
Co-authored-by: Mike Wong <mike.wong@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: David Turner <david.turner@elastic.co>
Move the JDBC functionality integration tests from `:sql:qa` to a separate
module `:sql:qa:jdbc`. This way the tests are isolated from the rest of the
integration tests and they only depend to the `:sql:jdbc` module, thus
removing the danger of accidentally pulling in some dependency that may
hide bugs.
Moreover this is a preparation for #56722, so that we can run those tests
between different JDBC and ES node versions and ensure forward
compatibility.
Move the rest of existing tests inside a new `:sql:qa:server` project, so that
the `:sql:qa` becomes the parent project for both and one can run all the integration
tests by using this parent project.
(cherry picked from commit c09f4a04484b8a43934fe58fbc41bd90b7dbcc76)
Changes:
* Adds API reference docs for the delete snapshot repo API.
* Corrects an error in the delete snapshot repo API spec. Comma-separated
repository names are not supported.
* Relocates the existing delete snapshot repo API example docs.
Elasticsearch enables HTTP compression by default. However, to mitigate
potential security risks like the BREACH attack, compression is disabled by
default if HTTPS is enabled.
This updates the `http.compression` setting definition accordingly and adds
additional context.
Co-authored-by: Leaf-Lin <39002973+Leaf-Lin@users.noreply.github.com>
* Changes for #52239.
* Incorporating review feedback from Julie T. Also single-sourcing nexted options in the Mapping page and referencing them in the Nested page.
* Moving tip after the introduction and clarifying limits.
* Update docs/reference/mapping.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Update docs/reference/mapping/types/nested.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Throttling nightly cleanup as much as we do has been over cautious.
Night cleanup should be more lenient in its throttling. We still
keep the same batch size, but now the requests per second scale
with the number of data nodes. If we have more than 5 data nodes,
we don't throttle at all.
Additionally, the API now has `requests_per_second` and `timeout` set.
So users calling the API directly can set the throttling.
This commit also adds a new setting `xpack.ml.nightly_maintenance_requests_per_second`.
This will allow users to adjust throttling of the nightly maintenance.
* [Transform] add support for terms agg in transforms (#56696)
This adds support for `terms` and `rare_terms` aggs in transforms.
The default behavior is that the results are collapsed in the following manner:
`<AGG_NAME>.<BUCKET_NAME>.<SUBAGGS...>...`
Or if no sub aggs exist
`<AGG_NAME>.<BUCKET_NAME>.<_doc_count>`
The mapping is also defined as `flattened` by default. This is to avoid field explosion while still providing (limited) search and aggregation capabilities.
This aggregation will perform normalizations of metrics
for a given series of data in the form of bucket values.
The aggregations supports the following normalizations
- rescale 0-1
- rescale 0-100
- percentage of sum
- mean normalization
- z-score normalization
- softmax normalization
To specify which normalization is to be used, it can be specified
in the normalize agg's `normalizer` field.
For example:
```
{
"normalize": {
"buckets_path": <>,
"normalizer": "percent"
}
}
```
* [DOCS] Add info about ILM and unallocated shards.
* Incorporated review feedback.
* Update docs/reference/ilm/actions/ilm-allocate.asciidoc
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Apply suggestions from code review
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
* Fix xref
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Co-authored-by: James Rodewig <james.rodewig@elastic.co>
This adds a few things to the `breakdown` of the profiler:
* `histogram` aggregations now contain `total_buckets` which is the
count of buckets that they collected. This could be useful when
debugging a histogram inside of another bucketing agg that is fairly
selective.
* All bucketing aggs that can delay their sub-aggregations will now add
a list of delayed sub-aggregations. This is useful because we
sometimes have fairly involved logic around which sub-aggregations get
delayed and this will save you from having to guess.
* Aggregtations wrapped in the `MultiBucketAggregatorWrapper` can't
accurately add anything to the breakdown. Instead they the wrapper
adds a marker entry `"multi_bucket_aggregator_wrapper": true` so we
can be quickly pick out such aggregations when debugging.
It also fixes a bug where `_count` breakdown entries were contributing
to the overall `time_in_nanos`. They didn't add a large amount of time
so it is unlikely that this caused a big problem, but I was there.
To support the arbitrary breakdown data this reworks the profiler so
that the `breakdown` can contain any data that is supported by
`StreamOutput#writeGenericValue(Object)` and
`XContentBuilder#value(Object)`.
This optional parameter can only be a string. To test out a transient custom
analysis chain, users are expected to use the 'tokenizer', 'filter', and
'char_filter' parameters.