value_count Aggregation optimization (backport of #54854) (#55076)
We found some problems during the test.
Data: 200Million docs, 1 shard, 0 replica
hits | avg | sum | value_count |
----------- | ------- | ------- | ----------- |
20,000 | .038s | .033s | .063s |
200,000 | .127s | .125s | .334s |
2,000,000 | .789s | .729s | 3.176s |
20,000,000 | 4.200s | 3.239s | 22.787s |
200,000,000 | 21.000s | 22.000s | 154.917s |
The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.
The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:
- Number cast to string overhead, and GC problems caused by a large
number of strings
- After the number type is converted to string, sorting and other
unnecessary operations are performed
Here is the proof of type conversion overhead.
```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
int size = stringSize(i);
if (COMPACT_STRINGS) {
byte[] buf = new byte[size];
getChars(i, size, buf);
return new String(buf, LATIN1);
} else {
byte[] buf = new byte[size * 2];
StringUTF16.getChars(i, size, buf);
return new String(buf, UTF16);
}
}
```
test type | average | min | max | sum
------------ | ------- | ---- | ----------- | -------
double->long | 32.2ns | 28ns | 0.024ms | 3.22s
long->double | 31.9ns | 28ns | 0.036ms | 3.19s
long->String | 163.8ns | 93ns | 1921 ms | 16.3s
particularly serious.
Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.
hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count |
| | | double | double | keyword | keyword | geo_point | geo_point |
| | | before | after | before | after | before | after |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s |
200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s |
2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s |
20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s |
200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s |
- The results are more in line with common sense. `value_count` is about
the same as `avg`, `sum`, etc., or even lower than these. Previously,
`value_count` was much larger than avg and sum, and it was not even an
order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
performance is improved by about 8 to 9 times; when calculating the
`geo_point` type, the performance is improved by 18 to 20 times.
2020-04-10 13:16:39 -04:00
|
|
|
[[release-notes-7.8.0]]
|
|
|
|
== {es} version 7.8.0
|
|
|
|
|
2020-06-12 09:00:17 -04:00
|
|
|
Also see <<breaking-changes-7.8,Breaking changes in 7.8>>.
|
value_count Aggregation optimization (backport of #54854) (#55076)
We found some problems during the test.
Data: 200Million docs, 1 shard, 0 replica
hits | avg | sum | value_count |
----------- | ------- | ------- | ----------- |
20,000 | .038s | .033s | .063s |
200,000 | .127s | .125s | .334s |
2,000,000 | .789s | .729s | 3.176s |
20,000,000 | 4.200s | 3.239s | 22.787s |
200,000,000 | 21.000s | 22.000s | 154.917s |
The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.
The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:
- Number cast to string overhead, and GC problems caused by a large
number of strings
- After the number type is converted to string, sorting and other
unnecessary operations are performed
Here is the proof of type conversion overhead.
```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
int size = stringSize(i);
if (COMPACT_STRINGS) {
byte[] buf = new byte[size];
getChars(i, size, buf);
return new String(buf, LATIN1);
} else {
byte[] buf = new byte[size * 2];
StringUTF16.getChars(i, size, buf);
return new String(buf, UTF16);
}
}
```
test type | average | min | max | sum
------------ | ------- | ---- | ----------- | -------
double->long | 32.2ns | 28ns | 0.024ms | 3.22s
long->double | 31.9ns | 28ns | 0.036ms | 3.19s
long->String | 163.8ns | 93ns | 1921 ms | 16.3s
particularly serious.
Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.
hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count |
| | | double | double | keyword | keyword | geo_point | geo_point |
| | | before | after | before | after | before | after |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s |
200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s |
2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s |
20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s |
200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s |
- The results are more in line with common sense. `value_count` is about
the same as `avg`, `sum`, etc., or even lower than these. Previously,
`value_count` was much larger than avg and sum, and it was not even an
order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
performance is improved by about 8 to 9 times; when calculating the
`geo_point` type, the performance is improved by 18 to 20 times.
2020-04-10 13:16:39 -04:00
|
|
|
|
|
|
|
[[breaking-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== Breaking changes
|
|
|
|
|
2020-06-12 09:00:17 -04:00
|
|
|
Aggregations::
|
|
|
|
* `value_count` aggregation optimization {pull}54854[#54854]
|
|
|
|
|
|
|
|
Features/Indices APIs::
|
|
|
|
* Add auto create action {pull}55858[#55858]
|
|
|
|
|
|
|
|
Mapping::
|
|
|
|
* Disallow changing 'enabled' on the root mapper {pull}54463[#54463] (issue: {issue}33933[#33933])
|
|
|
|
* Fix updating include_in_parent/include_in_root of nested field {pull}54386[#54386] (issue: {issue}53792[#53792])
|
|
|
|
|
|
|
|
|
|
|
|
[[deprecation-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== Deprecations
|
|
|
|
|
|
|
|
Authentication::
|
|
|
|
* Deprecate the `kibana` reserved user; introduce `kibana_system` user {pull}54967[#54967]
|
|
|
|
|
|
|
|
Cluster Coordination::
|
|
|
|
* Voting config exclusions should work with absent nodes {pull}50836[#50836] (issue: {issue}47990[#47990])
|
|
|
|
|
|
|
|
Features/Features::
|
|
|
|
* Add node local storage deprecation check {pull}54383[#54383] (issue: {issue}54374[#54374])
|
|
|
|
|
|
|
|
Features/Indices APIs::
|
|
|
|
* Deprecate local parameter for get field mapping request {pull}55014[#55014]
|
|
|
|
|
|
|
|
Infra/Core::
|
|
|
|
* Deprecate node local storage setting {pull}54374[#54374]
|
|
|
|
|
|
|
|
Infra/Plugins::
|
|
|
|
* Add xpack setting deprecations to deprecation API {pull}56290[#56290] (issue: {issue}54745[#54745])
|
|
|
|
* Deprecate disabling basic-license features {pull}54816[#54816] (issue: {issue}54745[#54745])
|
|
|
|
* Deprecated xpack "enable" settings should be no-ops {pull}55416[#55416] (issues: {issue}54745[#54745], {issue}54816[#54816])
|
|
|
|
* Make xpack.ilm.enabled setting a no-op {pull}55592[#55592] (issues: {issue}54745[#54745], {issue}54816[#54816], {issue}55416[#55416])
|
|
|
|
* Make xpack.monitoring.enabled setting a no-op {pull}55617[#55617] (issues: {issue}54745[#54745], {issue}54816[#54816], {issue}55416[#55416], {issue}55461[#55461], {issue}55592[#55592])
|
|
|
|
* Restore xpack.ilm.enabled and xpack.slm.enabled settings {pull}57383[#57383] (issues: {issue}54745[#54745], {issue}55416[#55416], {issue}55592[#55592])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[feature-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== New features
|
|
|
|
|
|
|
|
Aggregations::
|
|
|
|
* Add Student's t-test aggregation support {pull}54469[#54469] (issue: {issue}53692[#53692])
|
|
|
|
* Add support for filters to t-test aggregation {pull}54980[#54980] (issue: {issue}53692[#53692])
|
|
|
|
* Histogram field type support for Sum aggregation {pull}55681[#55681] (issue: {issue}53285[#53285])
|
|
|
|
* Histogram field type support for ValueCount and Avg aggregations {pull}55933[#55933] (issue: {issue}53285[#53285])
|
|
|
|
|
|
|
|
Features/Indices APIs::
|
|
|
|
* Add simulate template composition API _index_template/_simulate_index/{name} {pull}55686[#55686] (issue: {issue}53101[#53101])
|
|
|
|
|
|
|
|
Geo::
|
|
|
|
* Add geo_bounds aggregation support for geo_shape {pull}55328[#55328]
|
|
|
|
* Add geo_shape support for geotile_grid and geohash_grid {pull}55966[#55966]
|
|
|
|
* Add geo_shape support for the geo_centroid aggregation {pull}55602[#55602]
|
|
|
|
* Add new point field {pull}53804[#53804]
|
|
|
|
|
|
|
|
SQL::
|
|
|
|
* Implement DATETIME_FORMAT function for date/time formatting {pull}54832[#54832] (issue: {issue}53714[#53714])
|
|
|
|
* Implement DATETIME_PARSE function for parsing strings {pull}54960[#54960] (issue: {issue}53714[#53714])
|
|
|
|
* Implement scripting inside aggs {pull}55241[#55241] (issues: {issue}29980[#29980], {issue}36865[#36865], {issue}37271[#37271])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[enhancement-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== Enhancements
|
|
|
|
|
|
|
|
Aggregations::
|
|
|
|
* Aggs must specify a `field` or `script` (or both) {pull}52226[#52226]
|
|
|
|
* Expose aggregation usage in Feature Usage API {pull}55732[#55732] (issue: {issue}53746[#53746])
|
|
|
|
* Reduce memory for big aggregations run against many shards {pull}54758[#54758]
|
|
|
|
* Save memory in on aggs in async search {pull}55683[#55683]
|
|
|
|
|
|
|
|
Allocation::
|
|
|
|
* Disk decider respect watermarks for single data node {pull}55805[#55805]
|
|
|
|
* Improve same-shard allocation explanations {pull}56010[#56010]
|
|
|
|
|
|
|
|
Analysis::
|
|
|
|
* Add preserve_original setting in ngram token filter {pull}55432[#55432]
|
|
|
|
* Add preserve_original setting in edge ngram token filter {pull}55766[#55766] (issue: {issue}55767[#55767])
|
|
|
|
* Add pre-configured “lowercase” normalizer {pull}53882[#53882] (issue: {issue}53872[#53872])
|
|
|
|
|
|
|
|
Audit::
|
|
|
|
* Update the audit logfile list of system users {pull}55578[#55578] (issue: {issue}37924[#37924])
|
|
|
|
|
|
|
|
Authentication::
|
|
|
|
* Let realms gracefully terminate the authN chain {pull}55623[#55623]
|
|
|
|
|
|
|
|
Authorization::
|
|
|
|
* Add reserved_ml_user and reserved_ml_admin kibana privileges {pull}54713[#54713]
|
|
|
|
|
|
|
|
Autoscaling::
|
|
|
|
* Rollover: refactor out cluster state update {pull}53965[#53965]
|
|
|
|
|
|
|
|
CRUD::
|
|
|
|
* Avoid holding onto bulk items until all completed {pull}54407[#54407]
|
|
|
|
|
|
|
|
Cluster Coordination::
|
|
|
|
* Add voting config exclusion add and clear API spec and integration test cases {pull}55760[#55760] (issue: {issue}48131[#48131])
|
|
|
|
|
|
|
|
Features/CAT APIs::
|
|
|
|
* Add support for V2 index templates to /_cat/templates {pull}55829[#55829] (issue: {issue}53101[#53101])
|
|
|
|
|
|
|
|
Features/Indices APIs::
|
|
|
|
* Add HLRC support for simulate index template api {pull}55936[#55936] (issue: {issue}53101[#53101])
|
|
|
|
* Add prefer_v2_templates flag and index setting {pull}55411[#55411] (issue: {issue}53101[#53101])
|
|
|
|
* Add warnings/errors when V2 templates would match same indices as V1 {pull}54367[#54367] (issue: {issue}53101[#53101])
|
|
|
|
* Disallow merging existing mapping field definitions in templates {pull}57701[#57701] (issues: {issue}55607[#55607], {issue}55982[#55982], {issue}57393[#57393])
|
|
|
|
* Emit deprecation warning if multiple v1 templates match with a new index {pull}55558[#55558] (issue: {issue}53101[#53101])
|
|
|
|
* Guard adding the index.prefer_v2_templates settings for pre-7.8 nodes {pull}55546[#55546] (issues: {issue}53101[#53101], {issue}55411[#55411], {issue}55539[#55539])
|
|
|
|
* Handle merging dotted object names when merging V2 template mappings {pull}55982[#55982] (issue: {issue}53101[#53101])
|
|
|
|
* Throw exception on duplicate mappings metadata fields when merging templates {pull}57835[#57835] (issue: {issue}57701[#57701])
|
|
|
|
* Update template v2 api rest spec {pull}55948[#55948] (issue: {issue}53101[#53101])
|
|
|
|
* Use V2 index templates during index creation {pull}54669[#54669] (issue: {issue}53101[#53101])
|
|
|
|
* Use V2 templates when reading duplicate aliases and ingest pipelines {pull}54902[#54902] (issue: {issue}53101[#53101])
|
|
|
|
* Validate V2 templates more strictly {pull}56170[#56170] (issues: {issue}43737[#43737], {issue}46045[#46045], {issue}53101[#53101], {issue}53970[#53970])
|
|
|
|
|
|
|
|
Features/Java High Level REST Client::
|
|
|
|
* Enable support for decompression of compressed response within RestHighLevelClient {pull}53533[#53533]
|
|
|
|
|
|
|
|
Features/Stats::
|
|
|
|
* Fix available / total disk cluster stats {pull}32480[#32480] (issue: {issue}32478[#32478])
|
|
|
|
|
|
|
|
Features/Watcher::
|
|
|
|
* Delay warning about missing x-pack {pull}54265[#54265] (issue: {issue}40898[#40898])
|
|
|
|
|
|
|
|
Geo::
|
|
|
|
* Add geo_shape mapper supporting doc-values in Spatial Plugin {pull}55037[#55037] (issue: {issue}53562[#53562])
|
|
|
|
|
|
|
|
Infra/Core::
|
|
|
|
* Decouple Environment from DiscoveryNode {pull}54373[#54373]
|
|
|
|
* Ensure that the output of node roles are sorted {pull}54376[#54376] (issue: {issue}54370[#54370])
|
|
|
|
* Reintroduce system index APIs for Kibana {pull}54858[#54858] (issues: {issue}52385[#52385], {issue}53912[#53912])
|
|
|
|
* Schedule commands in current thread context {pull}54187[#54187] (issue: {issue}17143[#17143])
|
|
|
|
* Start resource watcher service early {pull}54993[#54993] (issue: {issue}54867[#54867])
|
|
|
|
|
|
|
|
Infra/Packaging::
|
|
|
|
* Make Windows JAVA_HOME handling consistent with Linux {pull}55261[#55261] (issue: {issue}55134[#55134])
|
|
|
|
|
|
|
|
|
|
|
|
Infra/REST API::
|
|
|
|
* Add validation to the usage service {pull}54617[#54617]
|
|
|
|
|
|
|
|
Infra/Scripting::
|
|
|
|
* Scripting: stats per context in nodes stats {pull}54008[#54008] (issue: {issue}50152[#50152])
|
|
|
|
|
|
|
|
Machine Learning::
|
|
|
|
* Add effective max model memory limit to ML info {pull}55529[#55529] (issue: {issue}63942[#63942])
|
|
|
|
* Add loss_function to regression {pull}56118[#56118]
|
|
|
|
* Add new inference_config field to trained model config {pull}54421[#54421]
|
|
|
|
* Add failed_category_count to model_size_stats {pull}55716[#55716] (issue: {issue}1130[#1130])
|
|
|
|
* Add prediction_field_type to inference config {pull}55128[#55128]
|
|
|
|
* Allow a certain number of ill-formatted rows when delimited format is specified {pull}55735[#55735] (issue: {issue}38890[#38890])
|
|
|
|
* Apply default timeout in StopDataFrameAnalyticsAction.Request {pull}55512[#55512]
|
|
|
|
* Create an annotation when a model snapshot is stored {pull}53783[#53783] (issue: {issue}52149[#52149])
|
|
|
|
* Do not execute ML CRUD actions when upgrade mode is enabled {pull}54437[#54437] (issue: {issue}54326[#54326])
|
|
|
|
* Make find_file_structure recognize Kibana CSV report timestamps {pull}55609[#55609] (issue: {issue}55586[#55586])
|
|
|
|
* More advanced model snapshot retention options {pull}56125[#56125] (issue: {issue}52150[#52150])
|
|
|
|
* Return assigned node in start/open job/datafeed response {pull}55473[#55473] (issue: {issue}54067[#54067])
|
|
|
|
* Skip daily maintenance activity if upgrade mode is enabled {pull}54565[#54565] (issue: {issue}54326[#54326])
|
|
|
|
* Start gathering and storing inference stats {pull}53429[#53429]
|
|
|
|
* Unassign data frame analytics tasks in SetUpgradeModeAction {pull}54523[#54523] (issue: {issue}54326[#54326])
|
|
|
|
* Speed up anomaly detection for the lat_long function {ml-pull}1102[#1102]
|
|
|
|
* Reduce CPU scheduling priority of native analysis processes to favor the ES
|
|
|
|
JVM when CPU is constrained. This change is implemented only for Linux and macOS,
|
|
|
|
not for Windows {ml-pull}1109[#1109]
|
|
|
|
* Take `training_percent` into account when estimating memory usage for
|
|
|
|
classification and regression {ml-pull}1111[#1111]
|
|
|
|
* Support maximize minimum recall when assigning class labels for multiclass
|
|
|
|
classification {ml-pull}1113[#1113]
|
|
|
|
* Improve robustness of anomaly detection to bad input data {ml-pull}1114[#1114]
|
|
|
|
* Add new `num_matches` and `preferred_to_categories` fields to category output
|
|
|
|
{ml-pull}1062[#1062]
|
|
|
|
* Add mean squared logarithmic error (MSLE) for regression {ml-pull}1101[#1101]
|
|
|
|
* Add pseudo-Huber loss for regression {ml-pull}1168[#1168]
|
|
|
|
* Reduce peak memory usage and memory estimates for classification and regression
|
|
|
|
{ml-pull}1125[#1125].)
|
|
|
|
* Reduce variability of classification and regression results across our target
|
|
|
|
operating systems {ml-pull}1127[#1127]
|
|
|
|
* Switch data frame analytics model memory estimates from kilobytes to
|
|
|
|
megabytes {ml-pull}1126[#1126] (issue: {issue}54506[#54506])
|
|
|
|
* Add a {ml} native code build for Linux on AArch64 {ml-pull}1132[#1132],
|
|
|
|
{ml-pull}1135[#1135]
|
|
|
|
* Improve data frame analytics runtime by optimising memory alignment for intrinsic
|
|
|
|
operations {ml-pull}1142[#1142]
|
|
|
|
* Fix spurious anomalies for count and sum functions after no data are received
|
|
|
|
for long periods of time {ml-pull}1158[#1158]
|
|
|
|
* Improve false positive rates from periodicity test for time series anomaly
|
|
|
|
detection {ml-pull}1177[#1177]
|
|
|
|
* Break progress reporting of data frame analyses into multiple phases {ml-pull}1179[#1179]
|
|
|
|
* Really centre the data before training for classification and regression begins. This
|
|
|
|
means we can choose more optimal smoothing bias and should reduce the number of trees
|
|
|
|
{ml-pull}1192[#1192]
|
|
|
|
|
|
|
|
Mapping::
|
|
|
|
* Merge V2 index/component template mappings in specific manner {pull}55607[#55607] (issue: {issue}53101[#53101])
|
|
|
|
|
|
|
|
Recovery::
|
|
|
|
* Avoid copying file chunks in peer covery {pull}56072[#56072] (issue: {issue}55353[#55353])
|
|
|
|
* Retry failed peer recovery due to transient errors {pull}55353[#55353]
|
|
|
|
|
|
|
|
SQL::
|
|
|
|
* Add BigDecimal support to JDBC {pull}56015[#56015] (issue: {issue}43806[#43806])
|
|
|
|
* Drop BASE TABLE type in favour for just TABLE {pull}54836[#54836]
|
|
|
|
* Relax version lock between server and clients {pull}56148[#56148]
|
|
|
|
|
value_count Aggregation optimization (backport of #54854) (#55076)
We found some problems during the test.
Data: 200Million docs, 1 shard, 0 replica
hits | avg | sum | value_count |
----------- | ------- | ------- | ----------- |
20,000 | .038s | .033s | .063s |
200,000 | .127s | .125s | .334s |
2,000,000 | .789s | .729s | 3.176s |
20,000,000 | 4.200s | 3.239s | 22.787s |
200,000,000 | 21.000s | 22.000s | 154.917s |
The performance of `avg`, `sum` and other is very close when performing
statistics, but the performance of `value_count` has always been poor,
even not on an order of magnitude. Based on some common-sense knowledge,
we think that `value_count` and sum are similar operations, and the time
consumed should be the same. Therefore, we have discussed the agg
of `value_count`.
The principle of counting in es is to traverse the field of each
document. If the field is an ordinary value, the count value is
increased by 1. If it is an array type, the count value is increased
by n. However, the problem lies in traversing each document and taking
out the field, which changes from disk to an object in the Java
language. We summarize its current problems with Elasticsearch as:
- Number cast to string overhead, and GC problems caused by a large
number of strings
- After the number type is converted to string, sorting and other
unnecessary operations are performed
Here is the proof of type conversion overhead.
```
// Java long to string source code, getChars is very time-consuming.
public static String toString(long i) {
int size = stringSize(i);
if (COMPACT_STRINGS) {
byte[] buf = new byte[size];
getChars(i, size, buf);
return new String(buf, LATIN1);
} else {
byte[] buf = new byte[size * 2];
StringUTF16.getChars(i, size, buf);
return new String(buf, UTF16);
}
}
```
test type | average | min | max | sum
------------ | ------- | ---- | ----------- | -------
double->long | 32.2ns | 28ns | 0.024ms | 3.22s
long->double | 31.9ns | 28ns | 0.036ms | 3.19s
long->String | 163.8ns | 93ns | 1921 ms | 16.3s
particularly serious.
Our optimization code is actually very simple. It is to manage different
types separately, instead of uniformly converting to string unified
processing. We added type identification in ValueCountAggregator, and
made special treatment for number and geopoint types to cancel their
type conversion. Because the string type is reduced and the string
constant is reduced, the improvement effect is very obvious.
hits | avg | sum | value_count | value_count | value_count | value_count | value_count | value_count |
| | | double | double | keyword | keyword | geo_point | geo_point |
| | | before | after | before | after | before | after |
----------- | ------- | ------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
20,000 | 38s | .033s | .063s | .026s | .030s | .030s | .038s | .015s |
200,000 | 127s | .125s | .334s | .078s | .116s | .099s | .278s | .031s |
2,000,000 | 789s | .729s | 3.176s | .439s | .348s | .386s | 3.365s | .178s |
20,000,000 | 4.200s | 3.239s | 22.787s | 2.700s | 2.500s | 2.600s | 25.192s | 1.278s |
200,000,000 | 21.000s | 22.000s | 154.917s | 18.990s | 19.000s | 20.000s | 168.971s | 9.093s |
- The results are more in line with common sense. `value_count` is about
the same as `avg`, `sum`, etc., or even lower than these. Previously,
`value_count` was much larger than avg and sum, and it was not even an
order of magnitude when the amount of data was large.
- When calculating numeric types such as `double` and `long`, the
performance is improved by about 8 to 9 times; when calculating the
`geo_point` type, the performance is improved by 18 to 20 times.
2020-04-10 13:16:39 -04:00
|
|
|
Search::
|
2020-06-12 09:00:17 -04:00
|
|
|
* Consolidate DelayableWriteable {pull}55932[#55932]
|
|
|
|
* Exists queries to MatchNoneQueryBuilder when the field is unmapped {pull}54857[#54857]
|
|
|
|
* Rewrite wrapper queries to match_none if possible {pull}55271[#55271]
|
|
|
|
* SearchService#canMatch takes into consideration the alias filter {pull}55120[#55120] (issue: {issue}55090[#55090])
|
|
|
|
|
|
|
|
Snapshot/Restore::
|
|
|
|
* Add GCS support for searchable snapshots {pull}55403[#55403]
|
|
|
|
* Allocate searchable snapshots with the balancer {pull}54889[#54889] (issues: {issue}50999[#50999], {issue}54729[#54729])
|
|
|
|
* Allow bulk snapshot deletes to abort {pull}56009[#56009] (issue: {issue}55773[#55773])
|
|
|
|
* Allow deleting multiple snapshots at once {pull}55474[#55474]
|
|
|
|
* Allow searching of snapshot taken while indexing {pull}55511[#55511] (issue: {issue}50999[#50999])
|
|
|
|
* Allow to prewarm the cache for searchable snapshot shards {pull}55322[#55322]
|
|
|
|
* Enable prewarming by default for searchable snapshots {pull}56201[#56201] (issue: {issue}55952[#55952])
|
|
|
|
* Permit searches to be concurrent to prewarming {pull}55795[#55795]
|
|
|
|
* Reduce contention in CacheFile.fileLock() method {pull}55662[#55662]
|
|
|
|
* Require soft deletes for searchable snapshots {pull}55453[#55453]
|
|
|
|
* Searchable Snapshots should respect max_restore_bytes_per_sec {pull}55952[#55952]
|
|
|
|
* Update the HDFS version used by HDFS Repo {pull}53693[#53693]
|
|
|
|
* Use streaming reads for GCS {pull}55506[#55506] (issue: {issue}55505[#55505])
|
|
|
|
* Use workers to warm cache parts {pull}55793[#55793] (issue: {issue}55322[#55322])
|
|
|
|
|
|
|
|
Task Management::
|
|
|
|
* Add indexName in update-settings task name {pull}55714[#55714]
|
|
|
|
* Add scroll info to search task description {pull}54606[#54606]
|
|
|
|
* Broadcast cancellation to only nodes have outstanding child tasks {pull}54312[#54312] (issues: {issue}50990[#50990], {issue}51157[#51157])
|
|
|
|
* Support hierarchical task cancellation {pull}54757[#54757] (issue: {issue}50990[#50990])
|
|
|
|
|
|
|
|
Transform::
|
|
|
|
* Add throttling {pull}56007[#56007] (issue: {issue}54862[#54862])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[bug-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== Bug fixes
|
|
|
|
|
|
|
|
Aggregations::
|
|
|
|
* Add analytics plugin usage stats to _xpack/usage {pull}54911[#54911] (issue: {issue}54847[#54847])
|
|
|
|
* Aggregation support for Value Scripts that change types {pull}54830[#54830] (issue: {issue}54655[#54655])
|
|
|
|
* Allow terms agg to default to depth first {pull}54845[#54845]
|
|
|
|
* Clean up how pipeline aggs check for multi-bucket {pull}54161[#54161] (issue: {issue}53215[#53215])
|
|
|
|
* Fix auto_date_histogram serialization bug {pull}54447[#54447] (issues: {issue}54382[#54382], {issue}54429[#54429])
|
|
|
|
* Fix error massage for unknown value type {pull}55821[#55821] (issue: {issue}55727[#55727])
|
|
|
|
* Fix scripted metric in CCS {pull}54776[#54776] (issue: {issue}54758[#54758])
|
|
|
|
* Use Decimal formatter for Numeric ValuesSourceTypes {pull}54366[#54366] (issue: {issue}54365[#54365])
|
|
|
|
|
|
|
|
Allocation::
|
|
|
|
* Fix Broken ExistingStoreRecoverySource Deserialization {pull}55657[#55657] (issue: {issue}55513[#55513])
|
|
|
|
|
|
|
|
|
|
|
|
Features/ILM+SLM::
|
|
|
|
* ILM stop step execution if writeIndex is false {pull}54805[#54805]
|
2020-06-04 01:35:00 -04:00
|
|
|
|
|
|
|
Features/Indices APIs::
|
2020-06-12 09:00:17 -04:00
|
|
|
* Fix NPE in MetadataIndexTemplateService#findV2Template {pull}54945[#54945]
|
|
|
|
* Fix creating filtered alias using now in a date_nanos range query failed {pull}54785[#54785] (issue: {issue}54315[#54315])
|
|
|
|
* Fix simulating index templates without specified index {pull}56295[#56295] (issues: {issue}53101[#53101], {issue}56255[#56255])
|
|
|
|
* Validate non-negative priorities for V2 index templates {pull}56139[#56139] (issue: {issue}53101[#53101])
|
|
|
|
|
|
|
|
Features/Watcher::
|
|
|
|
* Ensure watcher email action message ids are always unique {pull}56574[#56574]
|
|
|
|
|
|
|
|
Infra/Core::
|
|
|
|
* Add generic Set support to streams {pull}54769[#54769] (issue: {issue}54708[#54708])
|
|
|
|
|
|
|
|
Machine Learning::
|
|
|
|
* Fix GET _ml/inference so size param is respected {pull}57303[#57303] (issue: {issue}57298[#57298])
|
|
|
|
* Fix file structure finder multiline merge max for delimited formats {pull}56023[#56023]
|
|
|
|
* Validate at least one feature is available for DF analytics {pull}55876[#55876] (issue: {issue}55593[#55593])
|
|
|
|
* Trap and fail if insufficient features are supplied to data frame analyses.
|
|
|
|
Otherwise, classification and regression got stuck at zero analyzing progress
|
|
|
|
{ml-pull}1160[#1160] (issue: {issue}55593[#55593])
|
|
|
|
* Make categorization respect the model_memory_limit {ml-pull}1167[#1167]
|
|
|
|
(issue: {ml-issue}1130[#1130])
|
|
|
|
* Respect user overrides for max_trees for classification and regression
|
|
|
|
{ml-pull}1185[#1185]
|
|
|
|
* Reset memory status from soft_limit to ok when pruning is no longer required
|
|
|
|
{ml-pull}1193[#1193] (issue: {ml-issue}1131[#1131])
|
|
|
|
* Fix restore from training state for classification and regression
|
|
|
|
{ml-pull}1197[#1197]
|
|
|
|
* Improve the initialization of seasonal components for anomaly detection
|
|
|
|
{ml-pull}1201[#1201] (issue: {ml-issue}#1178[#1178])
|
|
|
|
|
|
|
|
Network::
|
|
|
|
* Fix issue with pipeline releasing bytes early {pull}54458[#54458]
|
|
|
|
* Handle TLS file updates during startup {pull}54999[#54999] (issue: {issue}54867[#54867])
|
|
|
|
|
|
|
|
SQL::
|
|
|
|
* Fix DATETIME_PARSE behaviour regarding timezones {pull}56158[#56158] (issue: {issue}54960[#54960])
|
|
|
|
|
|
|
|
Search::
|
|
|
|
* Don't expand default_field in query_string before required {pull}55158[#55158] (issue: {issue}53789[#53789])
|
|
|
|
* Fix `time_zone` for `query_string` and date fields {pull}55881[#55881] (issue: {issue}55813[#55813])
|
|
|
|
|
|
|
|
Security::
|
|
|
|
* Fix certutil http for empty password with JDK 11 and lower {pull}55437[#55437] (issue: {issue}55386[#55386])
|
|
|
|
|
|
|
|
Transform::
|
|
|
|
* Fix count when matching exact ids {pull}56544[#56544] (issue: {issue}56196[#56196])
|
|
|
|
* Fix http status code when bad scripts are provided {pull}56117[#56117] (issue: {issue}55994[#55994])
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[[regression-7.8.0]]
|
|
|
|
[float]
|
|
|
|
=== Regressions
|
|
|
|
|
|
|
|
Infra/Scripting::
|
|
|
|
* Don't double-wrap expression values {pull}54432[#54432] (issue: {issue}53661[#53661])
|
|
|
|
|