OpenSearch/rest-api-spec
Nik Everett 03e6d1b535
Add Variable Width Histogram Aggregation (backport of #42035) (#58440)
Implements a new histogram aggregation called `variable_width_histogram` which
dynamically determines bucket intervals based on document groupings. These
groups are determined by running a one-pass clustering algorithm on each shard
and then reducing each shard's clusters using an agglomerative
clustering algorithm.

This PR addresses #9572.

The shard-level clustering is done in one pass to minimize memory overhead. The
algorithm was lightly inspired by
[this paper](https://ieeexplore.ieee.org/abstract/document/1198387). It fetches
a small number of documents to sample the data and determine initial clusters.
Subsequent documents are then placed into one of these clusters, or a new one
if they are an outlier. This algorithm is described in more details in the
aggregation's docs.

At reduce time, a
[hierarchical agglomerative clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering)
algorithm inspired by [this paper](https://arxiv.org/abs/1802.00304)
continually merges the closest buckets from all shards (based on their
centroids) until the target number of buckets is reached.

The final values produced by this aggregation are approximate. Each bucket's
min value is used as its key in the histogram. Furthermore, buckets are merged
based on their centroids and not their bounds. So it is possible that adjacent
buckets will overlap after reduction. Because each bucket's key is its min,
this overlap is not shown in the final histogram. However, when such overlap
occurs, we set the key of the bucket with the larger centroid to the midpoint
between its minimum and the smaller bucket’s maximum:
`min[large] = (min[large] + max[small]) / 2`. This heuristic is expected to
increases the accuracy of the clustering.

Nodes are unable to share centroids during the shard-level clustering phase. In
the future, resolving https://github.com/elastic/elasticsearch/issues/50863
would let us solve this issue.

It doesn’t make sense for this aggregation to support the `min_doc_count`
parameter, since clusters are determined dynamically. The `order` parameter is
not supported here to keep this large PR from becoming too complex.

Co-authored-by: James Dorfman <jamesdorfman@users.noreply.github.com>
2020-06-25 11:40:47 -04:00
..
src/main/resources Add Variable Width Histogram Aggregation (backport of #42035) (#58440) 2020-06-25 11:40:47 -04:00
.gitignore Initial commit (blank repository) 2013-05-23 17:56:22 +02:00
README.markdown [DOCS] Note doc links should be live in REST API JSON specs (#53871) 2020-03-23 07:42:03 -04:00
build.gradle 7.x only REST specification fixes (#56736) 2020-05-26 12:33:57 -05:00
keywords.json Update rest-api-spec keyword list 2020-06-25 09:55:13 +01:00

README.markdown

Elasticsearch REST API JSON specification

This repository contains a collection of JSON files which describe the Elasticsearch HTTP API.

Their purpose is to formalize and standardize the API, to facilitate development of libraries and integrations.

Example for the "Create Index" API:

{
  "indices.create": {
    "documentation":{
      "url":"https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-create-index.html"
    },
    "stability": "stable",
    "url":{
      "paths":[
        {
          "path":"/{index}",
          "method":"PUT",
          "parts":{
            "index":{
              "type":"string",
              "description":"The name of the index"
            }
          }
        }
      ]
    },
    "params": {
      "timeout": {
        "type" : "time",
        "description" : "Explicit operation timeout"
      }
    },
    "body": {
      "description" : "The configuration for the index (`settings` and `mappings`)"
    }
  }
}

The specification contains:

  • The name of the API (indices.create), which usually corresponds to the client calls

  • Link to the documentation at the http://elastic.co website.

    IMPORANT: This should be a live link. Several downstream ES clients use this link to generate their documentation. Using a broken link or linking to yet-to-be-created doc pages can break the Elastic docs build.

  • stability indicating the state of the API, has to be declared explicitly or YAML tests will fail

    • experimental highly likely to break in the near future (minor/path), no bwc guarantees. Possibly removed in the future.
    • beta less likely to break or be removed but still reserve the right to do so
    • stable No backwards breaking changes in a minor
  • Request URL: HTTP method, path and parts

  • Request parameters

  • Request body specification

NOTE If an API is stable but it response should be treated as an arbitrary map of key values please notate this as followed

{
  "api.name": {
    "stability" : "stable",
    "response": {
      "treat_json_as_key_value" : true
    }
  }
}

Backwards compatibility

The specification follows the same backward compatibility guarantees as Elasticsearch.

  • Within a Major, additions only.
  • If an item has been documented wrong it should be deprecated instead as removing these might break downstream clients.
  • Major version change, may deprecate pieces or simply remove them given enough deprecation time.

Deprecations

The specification schema allows to codify API deprecations, either for an entire API, or for specific parts of the API, such as paths or parameters.

Entire API:

{
  "api" : {
    "deprecated" : {
      "version" : "7.0.0",
      "description" : "Reason API is being deprecated"
    },
  }
}

Specific paths and their parts:

{
  "api": {
    "url": {
      "paths": [
        {
          "path":"/{index}/{type}/{id}/_create",
          "method":"PUT",
          "parts":{
            "id":{
              "type":"string",
              "description":"Document ID"
            },
            "index":{
              "type":"string",
              "description":"The name of the index"
            },
            "type":{
              "type":"string",
              "description":"The type of the document",
              "deprecated":true
            }
          },
          "deprecated":{
            "version":"7.0.0",
            "description":"Specifying types in urls has been deprecated"
          }
        }
      ]
    }
  }
}

Parameters

{
  "api": {
    "url": {
      "params": {
        "stored_fields": {
          "type": "list",
          "description" : "",
          "deprecated" : {
            "version" : "7.0.0",
            "description" : "Reason parameter is being deprecated"
          }
        }
      }
    }
  }
}

License

This software is licensed under the Apache License, version 2 ("ALv2").