OpenSearch/docs/reference/release-notes
Nik Everett b99a50bcb9
value_count Aggregation optimization (backport of #54854) (#55076)
We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

    hits    |   avg   |   sum   | value_count |
----------- | ------- | ------- | ----------- |
     20,000 |   .038s |   .033s |       .063s |
    200,000 |   .127s |   .125s |       .334s |
  2,000,000 |   .789s |   .729s |      3.176s |
 20,000,000 |  4.200s |  3.239s |     22.787s |
200,000,000 | 21.000s | 22.000s |    154.917s |

The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

- Number cast to string overhead, and GC problems caused by a large
  number of strings
- After the number type is converted to string, sorting and other
  unnecessary operations are performed

Here is the proof of type conversion overhead.

```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}
```

  test type  | average |  min |     max     |   sum
------------ | ------- | ---- | ----------- | -------
double->long |  32.2ns | 28ns |     0.024ms |  3.22s
long->double |  31.9ns | 28ns |     0.036ms |  3.19s
long->String | 163.8ns | 93ns |  1921    ms | 16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

    hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
            |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
            |         |         |   before    |    after    |   before    |    after    |   before    |    after    |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
     20,000 |     38s |   .033s |       .063s |       .026s |       .030s |       .030s |       .038s |       .015s |
    200,000 |    127s |   .125s |       .334s |       .078s |       .116s |       .099s |       .278s |       .031s |
  2,000,000 |    789s |   .729s |      3.176s |       .439s |       .348s |       .386s |      3.365s |       .178s |
 20,000,000 |  4.200s |  3.239s |     22.787s |      2.700s |      2.500s |      2.600s |     25.192s |      1.278s |
200,000,000 | 21.000s | 22.000s |    154.917s |     18.990s |     19.000s |     20.000s |    168.971s |      9.093s |

- The results are more in line with common sense. `value_count` is about
  the same as `avg`, `sum`, etc., or even lower than these. Previously,
  `value_count` was much larger than avg and sum, and it was not even an
  order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
  performance is improved by about 8 to 9 times; when calculating the
  `geo_point` type, the performance is improved by 18 to 20 times.
2020-04-10 13:16:39 -04:00
..
7.0.0-alpha1.asciidoc Generate release notes for 7.0.0-alpha1. (#39899) 2019-03-12 08:10:18 +01:00
7.0.0-alpha2.asciidoc Rename MetaData to Metadata in all of the places (#54519) 2020-03-31 17:24:38 -04:00
7.0.0-beta1.asciidoc Rename MetaData to Metadata in all of the places (#54519) 2020-03-31 17:24:38 -04:00
7.0.0-rc1.asciidoc [DOCS] Removes coming tags 2019-05-28 08:58:41 -07:00
7.0.0-rc2.asciidoc release notes for 7.0.0-rc2 (#40796) 2019-04-03 11:48:59 -05:00
7.0.asciidoc Rename MetaData to Metadata in all of the places (#54519) 2020-03-31 17:24:38 -04:00
7.1.asciidoc [Doc] migration guide joda (#51986) 2020-03-23 08:29:01 +01:00
7.2.asciidoc [Doc] migration guide joda (#51986) 2020-03-23 08:29:01 +01:00
7.3.asciidoc Rename MetaData to Metadata in all of the places (#54519) 2020-03-31 17:24:38 -04:00
7.4.asciidoc [Doc] migration guide joda (#51986) 2020-03-23 08:29:01 +01:00
7.5.asciidoc [Doc] migration guide joda (#51986) 2020-03-23 08:29:01 +01:00
7.6.asciidoc [DOCS] Clarifies API key breaking change (#54522) 2020-04-01 08:58:15 -07:00
7.7.asciidoc Add breaking change note for #53669 2020-03-25 09:31:14 -04:00
7.8.asciidoc value_count Aggregation optimization (backport of #54854) (#55076) 2020-04-10 13:16:39 -04:00
highlights-7.0.0.asciidoc [7.x][DOCS] Updates ML links (#50387) (#50409) 2019-12-20 10:01:19 -08:00
highlights-7.1.0.asciidoc [7.x][DOCS] Updates ML links (#50387) (#50409) 2019-12-20 10:01:19 -08:00
highlights-7.2.0.asciidoc [7.x][DOCS] Updates ML links (#50387) (#50409) 2019-12-20 10:01:19 -08:00
highlights-7.3.0.asciidoc [7.x][DOCS] Updates ML links (#50387) (#50409) 2019-12-20 10:01:19 -08:00
highlights-7.4.0.asciidoc [7.x][DOCS] Updates ML links (#50387) (#50409) 2019-12-20 10:01:19 -08:00
highlights-7.5.0.asciidoc [DOCS] Correct `shape` field release in 7.5 release highlights (#54631) 2020-04-02 09:19:40 -04:00
highlights-7.6.0.asciidoc [DOCS] Adds machine learning highlights (#52334) 2020-02-14 08:51:55 -08:00
highlights-7.7.0.asciidoc [DOCS] Adds release highlight for transforms (#51555) 2020-01-29 08:35:02 -08:00
highlights-7.8.0.asciidoc [DOCS] Adds release highlights placeholder 2020-04-01 09:22:20 -07:00
highlights.asciidoc [DOCS] Adds release highlights placeholder 2020-04-01 09:22:20 -07:00