OpenSearch

Commit Graph

Author	SHA1	Message	Date
Alan Woodward	88b45dfa61	Convert TextFieldMapper to parametrized form (#63269 ) (#63392 ) As a result of this, we can remove a chunk of code from TypeParsers as well. Tests for search/index mode analyzers have moved into their own file. This commit also rationalises the serialization checks for parameters into a single SerializerCheck interface that takes the values includeDefaults, isConfigured and the value itself. Relates to #62988	2020-10-07 13:26:25 +01:00
Igor Motov	6a9cde2918	Add support for x_opaque_id to _cat/tasks (#63036 ) (#63135 ) Adds an optional column with support for x_opaque_id to _cat/tasks API. Closes #61118	2020-10-01 13:17:46 -04:00
Alan Woodward	63afc61b08	Introduce FetchContext (#62357 ) We currently pass a SearchContext around to share configuration among FetchSubPhases. With the introduction of runtime fields, it would be useful to start storing some state on this context to be shared between different subphases (for example, stored fields or search lookups can be loaded lazily but referred to by many different subphases). However, SearchContext is a very large and unwieldy class, and adding more methods or state here feels like a bridge too far. This commit introduces a new FetchContext class that exposes only those methods on SearchContext that are required for fetch phases. This reduces the API surface area for fetch phases considerably, and should give us some leeway to add further state.	2020-09-17 09:57:43 +01:00
Nik Everett	24a24d050a	Implement fields fetch for runtime fields (backport of #61995 ) (#62416 ) This implements the `fields` API in `_search` for runtime fields using doc values. Most of that implementation is stolen from the `docvalue_fields` fetch sub-phase, just moved into the same API that the `fields` API uses. At this point the `docvalue_fields` fetch phase looks like a special case of the `fields` API. While I was at it I moved the "which doc values sub-implementation should I use for fetching?" question from a bunch of `instanceof`s to a method on `LeafFieldData` so we can be much more flexible with what is returned and we're not forced to extend certain classes just to make the fetch phase happy. Relates to #59332	2020-09-15 20:24:10 -04:00
Nik Everett	771a8893a6	Add more debugging information for cardinality agg (#62317 ) (#62397 ) This adds two extra bits of info to the profiler: 1. Count of the number of different types of collectors. This lets us figure out if we're using the optimization for segment ordinals. It adds a few more similar counters just for good measure. 2. Profiles the `getLeafCollector` and `postCollection` methods. These are non-trivial for some aggregations, like cardinality.	2020-09-15 13:21:11 -04:00
Julie Tibshirani	4a19bdb2ea	Support the 'fields' option in inner_hits and top_hits. (#62337 ) This PR adds support for the 'fields' option in the following places: * Anytime `inner_hits` is used, for both fetching nested/ child docs and field collapsing * The `top_hits` aggregation Addresses #61949.	2020-09-14 11:51:45 -07:00
Nhat Nguyen	3d69b5c41e	Introduce point in time APIs in x-pack basic (#61062 ) This commit introduces a new API that manages point-in-times in x-pack basic. Elasticsearch pit (point in time) is a lightweight view into the state of the data as it existed when initiated. A search request by default executes against the most recent point in time. In some cases, it is preferred to perform multiple search requests using the same point in time. For example, if refreshes happen between search_after requests, then the results of those requests might not be consistent as changes happening between searches are only visible to the more recent point in time. A point in time must be opened before being used in search requests. The `keep_alive` parameter tells Elasticsearch how long it should keep a point in time around. ``` POST /my_index/_pit?keep_alive=1m ``` The response from the above request includes a `id`, which should be passed to the `id` of the `pit` parameter of search requests. ``` POST /_search { "query": { "match" : { "title" : "elasticsearch" } }, "pit": { "id": "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWICBXV1aWQyAAAFdXVpZDEAAQltYXRjaF9hbGw_gAAAAA==", "keep_alive": "1m" } } ``` Point-in-times are automatically closed when the `keep_alive` is elapsed. However, keeping point-in-times has a cost; hence, point-in-times should be closed as soon as they are no longer used in search requests. ``` DELETE /_pit { "id" : "46ToAwMDaWR4BXV1aWQxAgZub2RlXzEAAAAAAAAAAAEBYQNpZHkFdXVpZDIrBm5vZGVfMwAAAAAAAAAAKgFjA2lkeQV1dWlkMioGbm9kZV8yAAAAAAAAAAAMAWIBBXV1aWQyAAA=" } ``` #### Notable works in this change: - Move the search state to the coordinating node: #52741 - Allow searches with a specific reader context: #53989 - Add the ability to acquire readers in IndexShard: #54966 Relates #46523 Relates #26472 Co-authored-by: Jim Ferenczi <jimczi@apache.org>	2020-09-10 19:25:47 -04:00
Jake Landis	e80f68ed77	[7.x] Add test for item-level error when no write index defined for an alias in bulk API (#55503 ) (#61999 ) Add test for item-level error when no write index defined for an alia… Co-authored-by: Jake Landis <jake.landis@elastic.co> Co-authored-by: bellengao <gbl_long@163.com>	2020-09-09 14:27:25 -05:00
Nik Everett	1104d65465	Fix bug with terms' min_doc_count (#62130 ) (#62177 ) The `global_ordinals` implementation of `terms` had a bug when `min_doc_count: 0` that'd cause sub-aggregations to have array index out of bounds exceptions. Ooops. My fault. This fixes the bug by assigning ordinals to those buckets. Closes #62084	2020-09-09 13:04:51 -04:00
Luca Cavanna	ab8f65a099	[TEST] Don't specify a type unless needed (#62011 ) We have a couple of yaml tests that index documents under a 'test' type, while they could omit it. We do want to still test that specifying the type is still allowed in 7.x but we already have specific tests for that, and other tests should use the endpoint that don't require specifying a type.	2020-09-05 09:27:00 +02:00
Dan Hermann	d52ee17054	Adjust BWC after backport of #60818	2020-09-01 08:30:32 -05:00
Dan Hermann	88a448f1cd	Fix wrong result when executing bulk requests with and without pipeline (#60818 ) (#61777 )	2020-09-01 07:05:25 -05:00
Julie Tibshirani	85ad328df7	Ensure fetch fields aren't dropped when rewriting search. (#61390 ) Previously we didn't retain the requested fields when performing a shallow copy of the search source. This meant that when a search was rewritten, we could drop the requested fields and fail to return them in the response.	2020-08-20 14:58:58 -07:00
Nik Everett	1e6400285c	Some progress on failing runtime fields tests (bring #61098 to 7.x) (#61115 ) * Some progress on failing runtime fields tests (bring #61098 to 7.x) This breaks apart the a test for the `terms` aggregation into one that work for runtime fields and one that doesn't.	2020-08-17 09:56:55 -04:00
Nik Everett	639782da12	Break up a test for with runtime fields (brings #60931 to master) (#61103 ) (#61114 ) Breaks up an integration test into one that runtime fields can run and one that runtime fields have to skip. This is because runtime fields don't have global ords and we assert things about global ords in the test we have to skip.	2020-08-17 09:56:33 -04:00
Ryan Ernst	bce93b93b2	Increase docs and client rest test timeouts for Darwin (#61075 ) The Darwin CI hosts continue to struggle with timeouts. This commit increases the timouts for docs and client rest tests. relates #58286	2020-08-13 21:22:06 -07:00
Nhat Nguyen	843122ccce	Prevent shard relocation while closing index (#61072 ) We might fail to close an index if some shards are being relocated. Close #60913	2020-08-13 10:17:48 -04:00
Russ Cam	d593c963f3	Use deprecated object to deprecate synced flush API (#57096 ) Relates: #50835 This commit updates the synced flush REST API spec to deprecate the whole API.	2020-08-06 15:17:26 +10:00
István Zoltán Szabó	7c995aa3ff	[DOCS] Fixes broken links in rest API spec. (#60582 )	2020-08-03 15:31:25 +02:00
Nik Everett	2cde43b799	Allows nanosecond resolution in search_after (backport of #60328 ) (#60426 ) Allows nanosecond resolution in search_after (#60328) This fixes `search_after` to properly parse string formatted dates that have nanosecond resolution. Closes #52424	2020-08-03 08:17:48 -04:00
James Rodewig	771e9f142a	[DOCS] Move search pagination content to one page (#60515 ) (#60525 )	2020-07-31 12:40:40 -04:00
Julie Tibshirani	fe1d877b69	Avoid using string 'y' in fields REST tests. (#60471 ) Some yaml parsers interpret 'y' and 'yes' as the boolean 'true'.	2020-07-31 09:12:15 -07:00
Dan Hermann	bd6b20a4ec	Fix failing test for resolve index API (#60306 ) (#60508 )	2020-07-31 08:43:24 -05:00
Tim Brooks	85fdf959ad	Add configured indexing memory limit to node stats (#60414 ) This commit adds the configured memory limit to the node stats API.	2020-07-29 12:28:21 -06:00
Igor Motov	0dd53b76bd	Add aggregation list to node info (#60074 ) (#60256 ) Adds a full list of supported aggregations to the node info API. This list will be used in transform tests and telemetry mapping tests that will be added as follow-up PRs. Fixes #59774	2020-07-28 14:06:12 -04:00
Julie Tibshirani	c7bfb5de41	Add search `fields` parameter to support high-level field retrieval. (#60258 ) This feature adds a new `fields` parameter to the search request, which consults both the document `_source` and the mappings to fetch fields in a consistent way. The PR merges the `field-retrieval` feature branch. Addresses #49028 and #55363.	2020-07-28 10:58:20 -07:00
David Turner	9450ea08b4	Log and track open/close of transport connections (#60297 ) Transport connections between nodes remain in place until one or other node shuts down or the connection is disrupted by a flaky network. Today it is very difficult to demonstrate that transient failures and cluster instability are caused by the network even though this is often the case. In particular, transport connections open and close without logging anything, even at `DEBUG` level, making it very hard to quantify the scale of the problem or to correlate the networking problems with external events. This commit adds the missing `DEBUG`-level logging when transport connections open and close, and also tracks the total number of transport connections a node has opened as a measure of the stability of the underlying network.	2020-07-28 17:08:04 +01:00
Dan Hermann	fe12217c7f	[7.x] Move REST specs for data streams (#60111 )	2020-07-23 08:10:54 -05:00
Tim Brooks	ba01540d7e	Implement human readable indexing pressure stats (#60058 ) The indexing pressure stats do not currently have human readable variants. This commit add human readable variants and updates the documentation.	2020-07-22 12:07:59 -06:00
Nik Everett	49f365ddfd	Fix bug in deep pipeline agg serialization (#59984 ) In #54716 I removed pipeline aggregators from the aggregation result tree and caused us to read them from the request. This saves a bunch of round trip bytes, which is neat. But there was a bug in the backwards compatibility logic. You see, we still have to give the pipeline aggregations to nodes older than 7.8 over the wire because that is how they know what pipelines to run. They have the pipelines in the request but they don't read them. They use the ones in the response tree. Anyway, we had a bug where we were never sending pipelines defined two levels down. So while you are upgrading the pipeline wouldn't run. Sometimes. If the data node of the "first" result was post-7.8 and the coordinating node was pre-7.8. This fixes the bug.	2020-07-21 16:03:15 -04:00
James Baiera	b3363cf8f9	[7.x] Remove unneeded rest params from Data Stream Stats (#59575 ) (#59661 ) This PR removes the expand_wildcards and forbid_closed_indices parameters from the Data Streams Stats REST endpoint. These options are required for broadcast requests, but are not needed for anything in terms of resolving data streams. Instead, we just set a default set of IndicesOptions on the transport request.	2020-07-21 15:59:16 -04:00
Lee Hinman	8c7d414a3b	[7.x] Fix retrieving data stream stats for a DS with multiple backing indices (#59806 ) (#59810 ) Backports the following commits to 7.x: Fix retrieving data stream stats for a DS with multiple backing indices (#59806)	2020-07-17 16:56:07 -06:00
Lee Hinman	f6b08a3115	[7.x] Allow simulating existing composable index template (#59733 ) (#59798 ) Backports the following commits to 7.x: Allow simulating existing composable index template (#59733)	2020-07-17 13:10:07 -06:00
Benjamin Trent	b7f30fc929	[7.x] Adding new `require_alias` option to indexing requests (#58917 ) (#59769 ) * Adding new `require_alias` option to indexing requests (#58917) This commit adds the `require_alias` flag to requests that create new documents. This flag, when `true` prevents the request from automatically creating an index. Instead, the destination of the request MUST be an alias. When the flag is not set, or `false`, the behavior defaults to the `action.auto_create_index` settings. This is useful when an alias is required instead of a concrete index. closes https://github.com/elastic/elasticsearch/issues/55267	2020-07-17 10:24:58 -04:00
Igor Motov	2408803fad	Adds hard_bounds to histogram aggregations (#59175 ) (#59656 ) Adds a hard_bounds parameter to explicitly limit the buckets that a histogram can generate. This is especially useful in case of open ended ranges that can produce a very large number of buckets.	2020-07-16 15:31:53 -04:00
James Baiera	5f7e7e9410	[7.x] Data Stream Stats API (#58707 ) (#59566 ) This API reports on statistics important for data streams, including the number of data streams, the number of backing indices for those streams, the disk usage for each data stream, and the maximum timestamp for each data stream	2020-07-14 16:57:46 -04:00
Tim Brooks	408a07f96a	Separate coordinating and primary bytes in stats (#59487 ) Currently we combine coordinating and primary bytes into a single bucket for indexing pressure stats. This makes sense for rejection logic. However, for metrics it would be useful to separate them.	2020-07-14 12:37:06 -06:00
Dan Hermann	e54b4a729f	[7.x] Adds write_index_only option to put mapping API (#59539 )	2020-07-14 10:34:08 -05:00
Andrei Dan	7dcdaeae49	Default to @timestamp in composable template datastream definition (#59317 ) (#59516 ) This makes the data_stream timestamp field specification optional when defining a composable template. When there isn't one specified it will default to `@timestamp`. (cherry picked from commit 5609353c5d164e15a636c22019c9c17fa98aac30) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-07-14 12:36:54 +01:00
Andrei Dan	4180333bbc	[7.x] Composable templates: add a default mapping for @timestamp (#59244 ) (#59510 ) This adds a low precendece mapping for the `@timestamp` field with type `date`. This will aid with the bootstrapping of data streams as a timestamp mapping can be omitted when nanos precision is not needed. (cherry picked from commit 4e72f43d62edfe52a934367ce9809b5efbcdb531) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-07-14 11:29:33 +01:00
Tim Brooks	623df95a32	Adding indexing pressure stats to node stats API (#59467 ) We have recently added internal metrics to monitor the amount of indexing occurring on a node. These metrics introduce back pressure to indexing when memory utilization is too high. This commit exposes these stats through the node stats API.	2020-07-13 17:23:42 -06:00
Martijn van Groningen	b1b7bf3912	Make data streams a basic licensed feature. (#59392 ) Backport of #59293 to 7.x branch. * Create new data-stream xpack module. * Move TimestampFieldMapper to the new module, this results in storing a composable index template with data stream definition only to work with default distribution. This way data streams can only be used with default distribution, since a data stream can currently only be created if a matching composable index template exists with a data stream definition. * Renamed `_timestamp` meta field mapper to `_data_stream_timestamp` meta field mapper. * Add logic to put composable index template api to fail if `_data_stream_timestamp` meta field mapper isn't registered. So that a more understandable error is returned when attempting to store a template with data stream definition via the oss distribution. In a follow up the data stream transport and rest actions can be moved to the xpack data-stream module.	2020-07-13 17:26:46 +02:00
Dan Hermann	e01d73c737	[7.x] Data stream admin actions are now index-level actions	2020-07-10 14:36:18 -05:00
Dan Hermann	c26d2b5fa5	Data stream support for indices shard stores API	2020-07-09 13:11:45 -05:00
Martijn van Groningen	17bd559253	Fix the timestamp field of a data stream to @timestamp (#59210 ) Backport of #59076 to 7.x branch. The commit makes the following changes: * The timestamp field of a data stream definition in a composable index template can only be set to '@timestamp'. * Removed custom data stream timestamp field validation and reuse the validation from `TimestampFieldMapper` and instead only check that the _timestamp field mapping has been defined on a backing index of a data stream. * Moved code that injects _timestamp meta field mapping from `MetadataCreateIndexService#applyCreateIndexRequestWithV2Template58956(...)` method to `MetadataIndexTemplateService#collectMappings(...)` method. * Fixed a bug (#58956) that cases timestamp field validation to be performed for each template and instead of the final mappings that is created. * only apply _timestamp meta field if index is created as part of a data stream or data stream rollover, this fixes a docs test, where a regular index creation matches (logs-*) with a template with a data stream definition. Relates to #58642 Relates to #53100 Closes #58956 Closes #58583	2020-07-08 17:30:46 +02:00
Andrei Dan	24c6a30e2b	[7.9] GET data stream API returns additional information (#59128 ) (#59177 ) * GET data stream API returns additional information (#59128) This adds the data stream's index template, the configured ILM policy (if any) and the health status of the data stream to the GET _data_stream response. Restoring a data stream from a snapshot could install a data stream that doesn't match any composable templates. This also makes the `template` field in the `GET _data_stream` response optional. (cherry picked from commit 0d9c98a82353b088c782b6a04c44844e66137054) Signed-off-by: Andrei Dan <andrei.dan@elastic.co>	2020-07-07 20:30:09 +01:00
James Rodewig	6ed356ffc3	[DOCS] Replace `datatype` with `data type` (#58972 ) (#59184 )	2020-07-07 14:59:35 -04:00
Nik Everett	b99b2f1a08	Fix test for adjacency_matrix It needs to request the value count in a backwards compatible way.	2020-07-07 11:20:43 -04:00
Nik Everett	eb169ae226	Fix lookup support in adjacency matrix (backport of #59099 ) (#59108 ) This request: ``` POST /_search { "aggs": { "a": { "adjacency_matrix": { "filters": { "1": { "terms": { "t": { "index": "lookup", "id": "1", "path": "t" } } } } } } } } ``` Would fail with a 500 error and a message like: ``` { "error": { "root_cause": [ { "type": "illegal_state_exception", "reason":"async actions are left after rewrite" } ] } } ``` This fixes that by moving the query rewrite phase from a synchronous call on the data nodes into the standard aggregation rewrite phase which can properly handle the asynchronous actions.	2020-07-07 10:28:20 -04:00
Jake Landis	604c6dd528	7.x - Create plugin for yamlTest task (#56841 ) (#59090 ) This commit creates a new Gradle plugin to provide a separate task name and source set for running YAML based REST tests. The only project converted to use the new plugin in this PR is distribution/archives/integ-test-zip. For which the testing has been moved to :rest-api-spec since it makes the most sense and it avoids a small but awkward change to the distribution plugin. The remaining cases in modules, plugins, and x-pack will be handled in followups. This plugin is distinctly different from the plugin introduced in #55896 since the YAML REST tests are intended to be black box tests over HTTP. As such they should not (by default) have access to the classpath for that which they are testing. The YAML based REST tests will be moved to separate source sets (yamlRestTest). The which source is the target for the test resources is dependent on if this new plugin is applied. If it is not applied, it will default to the test source set. Further, this introduces a breaking change for plugin developers that use the YAML testing framework. They will now need to either use the new source set and matching task, or configure the rest resources to use the old "test" source set that matches the old integTest task. (The former should be preferred). As part of this change (which is also breaking for plugin developers) the rest resources plugin has been removed from the build plugin and now requires either explicit application or application via the new YAML REST test plugin. Plugin developers should be able to fix the breaking changes to the YAML tests by adding apply plugin: 'elasticsearch.yaml-rest-test' and moving the YAML tests under a yamlRestTest folder (instead of test)	2020-07-06 14:16:26 -05:00
Dan Hermann	550dcb0ca6	[7.x] Delete data stream API accepts multiple names (#59064 )	2020-07-06 08:06:10 -05:00
Martijn van Groningen	f0dd9b4ace	Add data stream timestamp validation via metadata field mapper (#59002 ) Backport of #58582 to 7.x branch. This commit adds a new metadata field mapper that validates, that a document has exactly a single timestamp value in the data stream timestamp field and that the timestamp field mapping only has `type`, `meta` or `format` attributes configured. Other attributes can affect the guarantee that an index with this meta field mapper has a useable timestamp field. The MetadataCreateIndexService inserts a data stream timestamp field mapper whenever a new backing index of a data stream is created. Relates to #53100	2020-07-06 11:32:33 +02:00
Dan Hermann	7c43cbca82	[7.x] Ignore matching data streams if include_data_streams is false (#59028 )	2020-07-03 14:51:32 -05:00
Dan Hermann	c1781bc7e7	[7.x] Add include_data_streams flag for authorization (#59008 )	2020-07-03 12:58:39 -05:00
Dan Hermann	5e7746d3bd	[7.x] Mirror privileges over data streams to their backing indices (#58991 )	2020-07-03 06:33:38 -05:00
Lee Hinman	d3d03fc1c6	[7.x] Add default composable templates for new indexing strategy (#57629 ) (#58757 ) Backports the following commits to 7.x: Add default composable templates for new indexing strategy (#57629)	2020-07-01 09:32:32 -06:00
David Turner	822b7421ce	Forbid read-only-allow-delete block in blocks API (#58727 ) The read-only-allow-delete block is not really under the user's control since Elasticsearch adds/removes it automatically. This commit removes support for it from the new API for adding blocks to indices that was introduced in #58094.	2020-07-01 13:18:26 +01:00
Martijn van Groningen	a0df96befb	Add data stream support to put mapping and update index settings APIs. (#58758 ) Backport of #58231 to 7.x branch. Change update index setting and put mapping api to execute on all backing indices if data stream is targeted. Relates #53100	2020-07-01 13:32:21 +02:00
David Turner	3a234d2669	Account for remaining recovery in disk allocator (#58800 ) Today the disk-based shard allocator accounts for incoming shards by subtracting the estimated size of the incoming shard from the free space on the node. This is an overly conservative estimate if the incoming shard has almost finished its recovery since in that case it is already consuming most of the disk space it needs. This change adds to the shard stats a measure of how much larger each store is expected to grow, computed from the ongoing recovery, and uses this to account for the disk usage of incoming shards more accurately. Backport of #58029 to 7.x * Picky picky * Missing type	2020-07-01 10:12:44 +01:00
Lee Hinman	efa0686a6c	Fix template name in mapping composition yml test (#58788 ) The warning was copied from elsewhere and just needed to use the correct template and index name.	2020-06-30 17:04:15 -06:00
Dan Hermann	1c2a726731	Data stream support for search shards API (#58486 ) (#58765 )	2020-06-30 17:59:51 -05:00
Dan Hermann	cae49b0fd7	[7.x] Add data stream support to open index API (#58767 )	2020-06-30 14:30:32 -05:00
Dan Hermann	a84ff81743	Data stream support for get field mappings API (#58488 ) (#58766 )	2020-06-30 13:45:04 -05:00
Julie Tibshirani	ab65a57d70	Merge mappings for composable index templates (#58709 ) This PR implements recursive mapping merging for composable index templates. When creating an index, we perform the following: * Add each component template mapping in order, merging each one in after the last. * Merge in the index template mappings (if present). * Merge in the mappings on the index request itself (if present). Some principles: * All 'structural' changes are disallowed (but everything else is fine). An object mapper can never be changed between `type: object` and `type: nested`. A field mapper can never be changed to an object mapper, and vice versa. * Generally, each section is merged recursively. This includes `object` mappings, as well as root options like `dynamic_templates` and `meta`. Once we reach 'leaf components' like field definitions, they always overwrite an existing one instead of being merged. Relates to #53101.	2020-06-30 08:01:37 -07:00
Yannick Welsch	b885cbff1a	Add index block api (#58716 ) Adds an API for putting an index block in place, which also ensures for write blocks that, once successfully returning to the user, all shards of the index are properly accounting for the block, for example that all in-flight writes to an index have been completed after adding the write block. This API allows coordinating more complex workflows, where it is crucial that an index is no longer receiving writes after the API completes, useful for example when marking an index as read-only during an upgrade in order to reindex its documents.	2020-06-30 14:06:52 +02:00
Nik Everett	03e6d1b535	Add Variable Width Histogram Aggregation (backport of #42035 ) (#58440 ) Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving https://github.com/elastic/elasticsearch/issues/50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex. Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>	2020-06-25 11:40:47 -04:00
Rory Hunter	ebe1d9cdbe	Update rest-api-spec keyword list Follow-up to 35aecf4c9aa. Somehow I missed the fact that there's an ILM API named `retry`, which is a keyword in Ruby. I've removed it from the keywords list.	2020-06-25 09:55:13 +01:00
Rory Hunter	e413de4203	Validate that REST API names do not contain keywords (#58452 ) If an API name (or components of a name) overlaps with a reserved word in the programming language for an ES client, then it's possible that the code that is generated from the API will not compile. This PR adds validation to check for such overlaps.	2020-06-25 09:48:54 +01:00
Martijn van Groningen	f4fad9c65a	Re-enable data streams yaml tests in bwc mode (#58500 ) Backport of #58403 to 7.x branch.	2020-06-24 16:59:51 +02:00
Martijn van Groningen	7dda9934f9	Keep track of timestamp_field mapping as part of a data stream (#58400 ) Backporting #58096 to 7.x branch. Relates to #53100 * use mapping source direcly instead of using mapper service to extract the relevant mapping details * moved assertion to TimestampField class and added helper method for tests * Improved logic that inserts timestamp field mapping into an mapping. If the timestamp field path consisted out of object fields and if the final mapping did not contain the parent field then an error occurred, because the prior logic assumed that the object field existed.	2020-06-22 17:46:38 +02:00
Jim Ferenczi	82db0b575c	Allow index filtering in field capabilities API (#57276 ) (#58299 ) This change allows to use an `index_filter` in the field capabilities API. Indices are filtered from the response if the provided query rewrites to `match_none` on every shard: ```` GET metrics-* { "index_filter": { "bool": { "must": [ "range": { "@timestamp": { "gt": "2019" } } } } } ```` The filtering is done on a best-effort basis, it uses the can match phase to rewrite queries to `match_none` instead of fully executing the request. The first shard that can match the filter is used to create the field capabilities response for the entire index. Closes #56195	2020-06-18 10:23:26 +02:00
Rory Hunter	e065d6cc91	Rename dangling index APIs (#58266 ) The dangling_indices.import API name could cause issues in the client libs because import is a reserved word in many languages. Rename the API to avoid this, and rename the other APIs for consistency. Related to #48366.	2020-06-18 08:58:32 +01:00
Nik Everett	ab2c6d9696	Save memory when auto_date_histogram is not on top (backport of #57304 ) (#58190 ) This builds an `auto_date_histogram` aggregator that natively aggregates from many buckets and uses it when the `auto_date_histogram` used to use `asMultiBucketAggregator` which should save a significant amount of memory in those cases. In particular, this happens when `auto_date_histogram` is a sub-aggregator of a multi-bucketing aggregator like `terms` or `histogram` or `filters`. For the most part we preserve the original implementation when `auto_date_histogram` only collects from a single bucket. It isn't possible to "just port the aggregator" without taking a pretty significant performance hit because we used to rewrite all of the buckets every time we switched to a coarser and coarser rounding configuration. Without some major surgery to how to delay sub-aggs we'd end up rewriting the delay list zillions of time if there are many buckets. The multi-bucket version of the aggregator has a "budget" of "wasted" buckets and only rewrites all of the buckets when we exceed that budget. Now that we don't rebucket every time we increase the rounding we can no longer get an accurate count of the number of buckets! So instead the aggregator uses an estimate of the number of buckets to trigger switching to a coarser rounding. This estimate is likely to be terrible when buckets are far apart compared to the rounding. So it also uses the difference between the first and last bucket to trigger switching to a coarser rounding. Which covers for the shortcomings of the bucket estimation technique pretty well. It also causes the aggregator to emit fewer buckets in cases where they'd be reduced together on the coordinating node. This is wonderful! But probably fairly rare. All of that does buy us some speed improvements when the aggregator is a child of multi-bucket aggregator: Without metrics or time zone: 25% faster With metrics: 15% faster With time zone: 22% faster Relates to #56487	2020-06-17 08:48:41 -04:00
Rory Hunter	03369e0980	Implement dangling indices API (#58176 ) Backport of #50920. Part of #48366. Implement an API for listing, importing and deleting dangling indices. Co-authored-by: David Turner <david.turner@elastic.co>	2020-06-16 21:50:38 +01:00
Dan Hermann	911d46370e	Prohibit clone, shrink, and split on a data stream's write index	2020-06-16 10:53:20 -05:00
Dan Hermann	17f3318732	[7.x] Resolve index API (#58037 )	2020-06-12 15:41:32 -05:00
Nik Everett	5056f2792d	Skip max_buckets test when it is flaky (#58038 ) Before #57042 the max_buckets test would consistently pass because the request would consistently fail. In particular, the request would fail on the data node. After #57042 it only fails on the coordinating node. When the max_buckets test is run in a mixed version cluster it consistently fails on either the data node or the coordinating node. Except when the coordinating node is missing #43095. In that case if the one data node has #57042 and one does not, and the one that doesn't gets the request first, fails it as expected, and then the coordinating node retries the request on the node with #57042. When that happens the request fails mysteriously with "partial shard failures" as the error message but not partial failures reported. This is exactly the bug fixed in #43095. This updates the test to be skipped in mixed version clusters without #43095 because they sometimes fail the test spuriously. The request fails in those cases, just like we expect, but with a mysterious error message. Closes #57657	2020-06-12 15:06:56 -04:00
Mayya Sharipova	8bd0147ba7	Correct how meta-field is defined for pre 7.8 hits (#57951 ) We keep a static list of meta-fields: META_FIELDS_BEFORE_7_8 as it was before. This is done to ensure the backwards compatability with pre 7.8 nodes. Closes #57831	2020-06-12 09:39:53 -04:00
Martijn van Groningen	01d8bb8cfa	Enforce valid field mapping exists for timestamp_field in templates. (#58036 ) Backport of #57741 to 7.x branch. Relates to #53100	2020-06-12 15:24:42 +02:00
Martijn van Groningen	f4199f2ee0	Prohibit append-only writes targeting backing indices directly. (#58025 ) Backport of #57788 to 7.x branch. Append-only writes can only target the corresponding data stream. Relates to #53100	2020-06-12 13:17:55 +02:00
Russ Cam	f51f9b19c7	Mark Component and Index template APIs as experimental (#57910 ) This commit marks the Component Template and Index Template APIs as experimental. (cherry picked from commit a85f2bede8eb632e3837ac7630f8dfdf46da6b52)	2020-06-10 14:07:09 +10:00
Dan Hermann	b501b282f8	Change default backing index naming scheme	2020-06-09 09:31:34 -05:00
Lee Hinman	6e8cf0973f	[7.x] Disallow merging existing mapping field definitions in templates (#57701 ) (#57822 ) Backports the following commits to 7.x: Disallow merging existing mapping field definitions in templates (#57701)	2020-06-08 12:56:09 -06:00
Benjamin Trent	16fcb64c99	Test mute (#57832 )	2020-06-08 14:03:16 -04:00
Nik Everett	ee0ce8ffaf	Fix a bug with missing fields in sig_terms (#57757 ) When you run a `significant_terms` aggregation on a field and it is mapped but there aren't any values for it then the count of the documents that match the query on that shard still have to be added to the overall doc count. I broke that in #57361. This fixes that. Closes #57402	2020-06-08 10:07:14 -04:00
Dan Hermann	3fe93e24a6	[7.x] Prohibit closing the write index for a data stream (#57740 )	2020-06-05 11:14:43 -05:00
Nik Everett	94b3eed6be	Re-mute test Tracked in #57402	2020-06-05 10:52:24 -04:00
Nik Everett	de27253d87	Drop skip on test after backporting fix Fixed in `98c379c507`. Closes #57402	2020-06-04 16:04:18 -04:00
Nik Everett	98c379c507	Merge remaining sig_terms into terms (#57397 ) (#57687 ) Merges the remaining implementation of `significant_terms` into `terms` so that we can more easilly make them work properly without `asMultiBucketAggregator` which should save memory and speed them up. Relates #56487	2020-06-04 14:32:32 -04:00
Nik Everett	97c06816a4	Fix an optimization in terms agg (backport #57438 ) (#57547 ) When the `terms` agg runs against strings and uses global ordinals it has an optimization when it collects segments that only ever have a single value for the particular string. This is very common. But I broke it in #57241. This fixes that optimization and adds `debug` information that you can use to see how often we collect segments of each type. And adds a test to make sure that I don't break the optimization again. We also had a specialiation for when there isn't a filter on the terms to aggregate. I had removed that specialization in #57241 which resulted in some slow down as well. This adds it back but in a more clear way. And, hopefully, a way that is marginally faster when there is a filter. Closes #57407	2020-06-02 14:57:45 -04:00
David Kyle	4d54bb3917	Correct expected warning in indices.create yml tests (#57409 ) v2 index is now composable, v1 is now legacy	2020-06-01 19:22:30 +01:00
David Kyle	82f27ef128	Mute msearch yml test (#57406 ) For #57402	2020-06-01 13:02:46 +01:00
Nik Everett	4263c25b2f	Save memory when histogram agg is not on top (backport of #57277 ) (#57377 ) This saves some memory when the `histogram` aggregation is not a top level aggregation by dropping `asMultiBucketAggregator` in favor of natively implementing multi-bucket storage in the aggregator. For the most part this just uses the `LongKeyedBucketOrds` that we built the first time we did this.	2020-05-29 15:07:37 -04:00
Martijn van Groningen	d8928b3f48	fixed allowed warnings in yaml test	2020-05-29 13:37:49 +02:00
Martijn van Groningen	04ef39da77	Change cluster info actions to be able to resolve data streams. (#57343 ) Backport of #56878 to 7.x branch. With this change the following APIs will be able to resolve data streams: get index, get mappings and ilm explain APIs. Relates to #53100	2020-05-29 12:17:53 +02:00
Russ Cam	2a9073d4c1	Deprecate local param in get_mapping.json (#57265 ) Relates: elastic/elasticsearch#55014 This commit deprecates the local param in get_mapping.json. This parameter is a no-op and field mappings are always retrieved locally. (cherry picked from commit 0b041cccd894f01d723fb2979f70c1cf279700a6)	2020-05-29 12:25:41 +10:00
Nik Everett	b9fe10866e	Make global ords terms simpler to understand (backport of #57241 ) (#57311 ) When the `terms` enum operates on non-numeric data it can collect it via global ordinals. It actually has two separate collection strategies for, one "dense" and one "remapping". Each of those strategies has two "iteration" strategies that it uses to build buckets, depending on whether or not we need buckets with `0` docs in them. Previously this was done with several `null` checks and never really explained. This change replaces those checks with two `CollectionStrategy` classes which have good stuff like documentation.	2020-05-28 16:52:35 -04:00
Martijn van Groningen	225ccd1cfa	Ensure template exists when creating data stream (#57275 ) Backporting #56888 to 7.x branch. Limit the creation of data streams only for namespaces that have a composable template with a data stream definition. This way we ensure that mappings/settings have been specified and will be used at data stream creation and data stream rollover. Also remove `timestamp_field` parameter from create data stream request and let the create data stream api resolve the timestamp field from the data stream definition snippet inside a composable template. Relates to #53100	2020-05-28 15:08:25 +02:00
Dan Hermann	2738998ebb	Limit _cat/indices test to versions with fix (#57244 ) (#57256 )	2020-05-27 16:57:24 -05:00
Lee Hinman	c0f732b9f6	[7.x] Rename template V2 classes to ComposableTemplate (#57183 ) (#57232 ) Backports the following commits to 7.x: Rename template V2 classes to ComposableTemplate (#57183)	2020-05-27 11:01:59 -06:00

1 2 3 4 5 ...

2140 Commits