OpenSearch

mirror of https://github.com/honeymoose/OpenSearch.git synced 2025-03-09 14:34:43 +00:00

History

value_count Aggregation optimization (backport of #54854 ) (#55076 )

We found some problems during the test.

Data: 200Million docs, 1 shard, 0 replica

    hits    |   avg   |   sum   | value_count |
----------- | ------- | ------- | ----------- |
     20,000 |   .038s |   .033s |       .063s |
    200,000 |   .127s |   .125s |       .334s |
  2,000,000 |   .789s |   .729s |      3.176s |
 20,000,000 |  4.200s |  3.239s |     22.787s |
200,000,000 | 21.000s | 22.000s |    154.917s |

The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.

The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:

- Number cast to string overhead, and GC problems caused by a large
  number of strings
- After the number type is converted to string, sorting and other
  unnecessary operations are performed

Here is the proof of type conversion overhead.

```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
        int size = stringSize(i);
        if (COMPACT_STRINGS) {
            byte[] buf = new byte[size];
            getChars(i, size, buf);
            return new String(buf, LATIN1);
        } else {
            byte[] buf = new byte[size * 2];
            StringUTF16.getChars(i, size, buf);
            return new String(buf, UTF16);
        }
}
```

  test type  | average |  min |     max     |   sum
------------ | ------- | ---- | ----------- | -------
double->long |  32.2ns | 28ns |     0.024ms |  3.22s
long->double |  31.9ns | 28ns |     0.036ms |  3.19s
long->String | 163.8ns | 93ns |  1921    ms | 16.3s

particularly serious.

Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.

    hits    |   avg   |   sum   | value_count | value_count | value_count | value_count | value_count | value_count |
            |         |         |    double   |    double   |   keyword   |   keyword   |  geo_point  |  geo_point  |
            |         |         |   before    |    after    |   before    |    after    |   before    |    after    |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
     20,000 |     38s |   .033s |       .063s |       .026s |       .030s |       .030s |       .038s |       .015s |
    200,000 |    127s |   .125s |       .334s |       .078s |       .116s |       .099s |       .278s |       .031s |
  2,000,000 |    789s |   .729s |      3.176s |       .439s |       .348s |       .386s |      3.365s |       .178s |
 20,000,000 |  4.200s |  3.239s |     22.787s |      2.700s |      2.500s |      2.600s |     25.192s |      1.278s |
200,000,000 | 21.000s | 22.000s |    154.917s |     18.990s |     19.000s |     20.000s |    168.971s |      9.093s |

- The results are more in line with common sense. `value_count` is about
  the same as `avg`, `sum`, etc., or even lower than these. Previously,
  `value_count` was much larger than avg and sum, and it was not even an
  order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
  performance is improved by about 8 to 9 times; when calculating the
  `geo_point` type, the performance is improved by 18 to 20 times.

2020-04-10 13:16:39 -04:00

community-clients

[DOCS] Add Elixir Bulk Processor to community clients (#50630 )

2020-01-15 16:08:56 -05:00

groovy-api

Make sure to use the type _doc in the REST documentation. (#34662 )

2018-10-22 11:54:04 -07:00

java-api

[DOCS] Remove unneeded redirects (#50510 )

2020-01-06 09:11:48 -06:00

java-rest

Broadcast cancellation to only nodes have outstanding child tasks (#54312 )

2020-04-06 11:11:29 -04:00

painless

Docs: Use splitOnToken instead of custom function (#48408 ) (#54364 )

2020-03-27 15:04:27 -06:00

perl

[DOCS] Various spelling corrections (#37046 )

2019-01-07 14:44:12 +01:00

plugins

Add nori_number token filter in analysis-nori (#53583 )

2020-03-23 19:53:34 +01:00

python

[DOCS] Update ES Python client doc links (#53092 )

2020-03-09 10:34:38 -04:00

reference

value_count Aggregation optimization (backport of #54854 ) (#55076 )

2020-04-10 13:16:39 -04:00

resiliency

[DOCS] Change http://elastic.co -> https (#48479 ) (#51812 )

2020-02-03 09:50:11 -05:00

ruby

[DOCS] Various spelling corrections (#37046 )

2019-01-07 14:44:12 +01:00

src/test

Wait for Active license before running CCR API tests (#53966 )

2020-03-24 14:29:45 +01:00

build.gradle

Improve total build configuration time (#54611 ) (#54994 )

2020-04-08 16:47:02 -07:00

README.asciidoc

[DOCS] Fix typo and ensure asterisks render properly (#52991 ) (#52992 )

2020-03-02 11:56:42 +11:00

Versions.asciidoc

Fix broken link in client documentation (#51834 )

2020-03-13 14:38:11 +01:00

README.asciidoc

The Elasticsearch docs are in AsciiDoc format and can be built using the
Elasticsearch documentation build process.

See: https://github.com/elastic/docs

=== Backporting doc fixes

* Doc changes should generally be made against master and backported through to the current version
(as applicable).

* Changes can also be backported to the maintenance version of the previous major version.
This is typically reserved for technical corrections, as it can require resolving more complex
merge conflicts, fixing test failures, and figuring out where to apply the change.

* Avoid backporting to out-of-maintenance versions.
Docs follow the same policy as code and fixes are not ordinarily merged to
versions that are out of maintenance.

* Do not backport doc changes to https://www.elastic.co/support/eol[EOL versions].

=== Snippet testing

Snippets marked with `[source,console]` are automatically annotated with
"VIEW IN CONSOLE" and "COPY AS CURL" in the documentation and are automatically
tested by the command `./gradlew -pdocs check`. To test just the docs from a
single page, use e.g. `./gradlew -pdocs integTestRunner --tests "\*rollover*"`.

By default each `[source,console]` snippet runs as its own isolated test. You
can manipulate the test execution in the following ways:

* `// TEST`: Explicitly marks a snippet as a test. Snippets marked this way
are tests even if they don't have `[source,console]` but usually `// TEST` is
used for its modifiers:
* `// TEST[s/foo/bar/]`: Replace `foo` with `bar` in the generated test. This
should be used sparingly because it makes the snippet "lie". Sometimes,
though, you can use it to make the snippet more clear. Keep in mind that
if there are multiple substitutions then they are applied in the order that
they are defined.
* `// TEST[catch:foo]`: Used to expect errors in the requests. Replace `foo`
with `request` to expect a 400 error, for example. If the snippet contains
multiple requests then only the last request will expect the error.
* `// TEST[continued]`: Continue the test started in the last snippet. Between
tests the nodes are cleaned: indexes are removed, etc. This prevents that
from happening between snippets because the two snippets are a single test.
This is most useful when you have text and snippets that work together to
tell the story of some use case because it merges the snippets (and thus the
use case) into one big test.
* You can't use `// TEST[continued]` immediately after `// TESTSETUP` or
`// TEARDOWN`.
* `// TEST[skip:reason]`: Skip this test. Replace `reason` with the actual
reason to skip the test. Snippets without `// TEST` or `// CONSOLE` aren't
considered tests anyway but this is useful for explicitly documenting the
reason why the test shouldn't be run.
* `// TEST[setup:name]`: Run some setup code before running the snippet. This
is useful for creating and populating indexes used in the snippet. The setup
code is defined in `docs/build.gradle`. See `// TESTSETUP` below for a
similar feature.
* `// TEST[warning:some warning]`: Expect the response to include a `Warning`
header. If the response doesn't include a `Warning` header with the exact
text then the test fails. If the response includes `Warning` headers that
aren't expected then the test fails.
* `[source,console-result]`: Matches this snippet against the body of the
response of the last test. If the response is JSON then order is ignored. If
you add `// TEST[continued]` to the snippet after `[source,console-result]`
it will continue in the same test, allowing you to interleave requests with
responses to check.
* `// TESTRESPONSE`: Explicitly marks a snippet as a test response even without
`[source,console-result]`. Similarly to `// TEST` this is mostly used for
its modifiers.
* You can't use `[source,console-result]` immediately after `// TESTSETUP`.
Instead, consider using `// TEST[continued]` or rearrange your snippets.

NOTE: Previously we only used `// TESTRESPONSE` instead of
`[source,console-result]` so you'll see that a lot in older branches but we
prefer `[source,console-result]` now.

* `// TESTRESPONSE[s/foo/bar/]`: Substitutions. See `// TEST[s/foo/bar]` for
how it works. These are much more common than `// TEST[s/foo/bar]` because
they are useful for eliding portions of the response that are not pertinent
to the documentation.
* One interesting difference here is that you often want to match against
the response from Elasticsearch. To do that you can reference the "body" of
the response like this: `// TESTRESPONSE[s/"took": 25/"took": $body.took/]`.
Note the `$body` string. This says "I don't expect that 25 number in the
response, just match against what is in the response." Instead of writing
the path into the response after `$body` you can write `$_path` which
"figures out" the path. This is especially useful for making sweeping
assertions like "I made up all the numbers in this example, don't compare
them" which looks like `// TESTRESPONSE[s/\d+/$body.$_path/]`.
* `// TESTRESPONSE[non_json]`: Add substitutions for testing responses in a
format other than JSON. Use this after all other substitutions so it doesn't
make other substitutions difficult.
* `// TESTRESPONSE[skip:reason]`: Skip the assertions specified by this
response.
* `// TESTSETUP`: Marks this snippet as the "setup" for all other snippets in
this file. This is a somewhat natural way of structuring documentation. You
say "this is the data we use to explain this feature" then you add the
snippet that you mark `// TESTSETUP` and then every snippet will turn into
a test that runs the setup snippet first. See the "painless" docs for a file
that puts this to good use. This is fairly similar to `// TEST[setup:name]`
but rather than the setup defined in `docs/build.gradle` the setup is defined
right in the documentation file. In general, we should prefer `// TESTSETUP`
over `// TEST[setup:name]` because it makes it more clear what steps have to
be taken before the examples will work. Tip: `// TESTSETUP` can only be used
on the first snippet of a document.
* `// TEARDOWN`: Ends and cleans up a test series started with `// TESTSETUP` or
`// TEST[setup:name]`. You can use `// TEARDOWN` to set up multiple tests in
the same file.
* `// NOTCONSOLE`: Marks this snippet as neither `// CONSOLE` nor
`// TESTRESPONSE`, excluding it from the list of unconverted snippets. We
should only use this for snippets that *are* JSON but are *not* responses or
requests.

In addition to the standard CONSOLE syntax these snippets can contain blocks
of yaml surrounded by markers like this:

```
startyaml
- compare_analyzers: {index: thai_example, first: thai, second: rebuilt_thai}
endyaml
```

This allows slightly more expressive testing of the snippets. Since that syntax
is not supported by `[source,console]` the usual way to incorporate it is with a
`// TEST[s//]` marker like this:

```
// TEST[s/\n$/\nstartyaml\n - compare_analyzers: {index: thai_example, first: thai, second: rebuilt_thai}\nendyaml\n/]
```

Any place you can use json you can use elements like `$body.path.to.thing`
which is replaced on the fly with the contents of the thing at `path.to.thing`
in the last response.