mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-17 10:25:15 +00:00
Implements a new histogram aggregation called `variable_width_histogram` which dynamically determines bucket intervals based on document groupings. These groups are determined by running a one-pass clustering algorithm on each shard and then reducing each shard's clusters using an agglomerative clustering algorithm. This PR addresses #9572. The shard-level clustering is done in one pass to minimize memory overhead. The algorithm was lightly inspired by [this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches a small number of documents to sample the data and determine initial clusters. Subsequent documents are then placed into one of these clusters, or a new one if they are an outlier. This algorithm is described in more details in the aggregation's docs. At reduce time, a [hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304) continually merges the closest buckets from all shards (based on their centroids) until the target number of buckets is reached. The final values produced by this aggregation are approximate. Each bucket's min value is used as its key in the histogram. Furthermore, buckets are merged based on their centroids and not their bounds. So it is possible that adjacent buckets will overlap after reduction. Because each bucket's key is its min, this overlap is not shown in the final histogram. However, when such overlap occurs, we set the key of the bucket with the larger centroid to the midpoint between its minimum and the smaller bucket’s maximum: `min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to increases the accuracy of the clustering. Nodes are unable to share centroids during the shard-level clustering phase. In the future, resolving https://github.com/elastic/elasticsearch/issues/50863 would let us solve this issue. It doesn’t make sense for this aggregation to support the `min_doc_count` parameter, since clusters are determined dynamically. The `order` parameter is not supported here to keep this large PR from becoming too complex. Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
The Elasticsearch docs are in AsciiDoc format and can be built using the Elasticsearch documentation build process. See: https://github.com/elastic/docs === Backporting doc fixes * Doc changes should generally be made against master and backported through to the current version (as applicable). * Changes can also be backported to the maintenance version of the previous major version. This is typically reserved for technical corrections, as it can require resolving more complex merge conflicts, fixing test failures, and figuring out where to apply the change. * Avoid backporting to out-of-maintenance versions. Docs follow the same policy as code and fixes are not ordinarily merged to versions that are out of maintenance. * Do not backport doc changes to https://www.elastic.co/support/eol[EOL versions]. === Snippet testing Snippets marked with `[source,console]` are automatically annotated with "VIEW IN CONSOLE" and "COPY AS CURL" in the documentation and are automatically tested by the command `./gradlew -pdocs check`. To test just the docs from a single page, use e.g. `./gradlew -pdocs integTestRunner --tests "\*rollover*"`. By default each `[source,console]` snippet runs as its own isolated test. You can manipulate the test execution in the following ways: * `// TEST`: Explicitly marks a snippet as a test. Snippets marked this way are tests even if they don't have `[source,console]` but usually `// TEST` is used for its modifiers: * `// TEST[s/foo/bar/]`: Replace `foo` with `bar` in the generated test. This should be used sparingly because it makes the snippet "lie". Sometimes, though, you can use it to make the snippet more clear. Keep in mind that if there are multiple substitutions then they are applied in the order that they are defined. * `// TEST[catch:foo]`: Used to expect errors in the requests. Replace `foo` with `request` to expect a 400 error, for example. If the snippet contains multiple requests then only the last request will expect the error. * `// TEST[continued]`: Continue the test started in the last snippet. Between tests the nodes are cleaned: indexes are removed, etc. This prevents that from happening between snippets because the two snippets are a single test. This is most useful when you have text and snippets that work together to tell the story of some use case because it merges the snippets (and thus the use case) into one big test. * You can't use `// TEST[continued]` immediately after `// TESTSETUP` or `// TEARDOWN`. * `// TEST[skip:reason]`: Skip this test. Replace `reason` with the actual reason to skip the test. Snippets without `// TEST` or `// CONSOLE` aren't considered tests anyway but this is useful for explicitly documenting the reason why the test shouldn't be run. * `// TEST[setup:name]`: Run some setup code before running the snippet. This is useful for creating and populating indexes used in the snippet. The setup code is defined in `docs/build.gradle`. See `// TESTSETUP` below for a similar feature. * `// TEST[warning:some warning]`: Expect the response to include a `Warning` header. If the response doesn't include a `Warning` header with the exact text then the test fails. If the response includes `Warning` headers that aren't expected then the test fails. * `[source,console-result]`: Matches this snippet against the body of the response of the last test. If the response is JSON then order is ignored. If you add `// TEST[continued]` to the snippet after `[source,console-result]` it will continue in the same test, allowing you to interleave requests with responses to check. * `// TESTRESPONSE`: Explicitly marks a snippet as a test response even without `[source,console-result]`. Similarly to `// TEST` this is mostly used for its modifiers. * You can't use `[source,console-result]` immediately after `// TESTSETUP`. Instead, consider using `// TEST[continued]` or rearrange your snippets. NOTE: Previously we only used `// TESTRESPONSE` instead of `[source,console-result]` so you'll see that a lot in older branches but we prefer `[source,console-result]` now. * `// TESTRESPONSE[s/foo/bar/]`: Substitutions. See `// TEST[s/foo/bar]` for how it works. These are much more common than `// TEST[s/foo/bar]` because they are useful for eliding portions of the response that are not pertinent to the documentation. * One interesting difference here is that you often want to match against the response from Elasticsearch. To do that you can reference the "body" of the response like this: `// TESTRESPONSE[s/"took": 25/"took": $body.took/]`. Note the `$body` string. This says "I don't expect that 25 number in the response, just match against what is in the response." Instead of writing the path into the response after `$body` you can write `$_path` which "figures out" the path. This is especially useful for making sweeping assertions like "I made up all the numbers in this example, don't compare them" which looks like `// TESTRESPONSE[s/\d+/$body.$_path/]`. * `// TESTRESPONSE[non_json]`: Add substitutions for testing responses in a format other than JSON. Use this after all other substitutions so it doesn't make other substitutions difficult. * `// TESTRESPONSE[skip:reason]`: Skip the assertions specified by this response. * `// TESTSETUP`: Marks this snippet as the "setup" for all other snippets in this file. This is a somewhat natural way of structuring documentation. You say "this is the data we use to explain this feature" then you add the snippet that you mark `// TESTSETUP` and then every snippet will turn into a test that runs the setup snippet first. See the "painless" docs for a file that puts this to good use. This is fairly similar to `// TEST[setup:name]` but rather than the setup defined in `docs/build.gradle` the setup is defined right in the documentation file. In general, we should prefer `// TESTSETUP` over `// TEST[setup:name]` because it makes it more clear what steps have to be taken before the examples will work. Tip: `// TESTSETUP` can only be used on the first snippet of a document. * `// TEARDOWN`: Ends and cleans up a test series started with `// TESTSETUP` or `// TEST[setup:name]`. You can use `// TEARDOWN` to set up multiple tests in the same file. * `// NOTCONSOLE`: Marks this snippet as neither `// CONSOLE` nor `// TESTRESPONSE`, excluding it from the list of unconverted snippets. We should only use this for snippets that *are* JSON but are *not* responses or requests. In addition to the standard CONSOLE syntax these snippets can contain blocks of yaml surrounded by markers like this: ``` startyaml - compare_analyzers: {index: thai_example, first: thai, second: rebuilt_thai} endyaml ``` This allows slightly more expressive testing of the snippets. Since that syntax is not supported by `[source,console]` the usual way to incorporate it is with a `// TEST[s//]` marker like this: ``` // TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: thai_example, first: thai, second: rebuilt_thai}\nendyaml\n/] ``` Any place you can use json you can use elements like `$body.path.to.thing` which is replaced on the fly with the contents of the thing at `path.to.thing` in the last response.