mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-03-24 17:09:48 +00:00
* Reintroduce chunking to improve data extractor performance Performing a sorted search/scroll over a period of time that matches a lot of documents is very expensive because for each page all documents are traversed. The solution is to chunk the search time and perform separate search/scrolls for each chunk. This commit is introducing a new `chung` config in `datafeed_config` whose mode can be set to either of AUTO, OFF, MANUAL, with the latter allowing to specify an explicit chunk size. When set to AUTO, a heuristic is used in order to determine the chunk size. The heuristic is based on estimating the time interval within which we expect `scroll_size` documents and then taking the 10x multiple of that. Based on benchmarking, this method gives a dramatic performance increase. For example, for the citizens dataset it improved the ingest rate from 0.33M docs / minute to 13.6M docs / minute. Farequote is now done in ~1 second. Finally, note that when `chunk` is not specified, it defaults to AUTO when aggregations are not set and to OFF otherwise. This is because the chunk size heuristic does not lend itself great for aggregations where one needs to chunk based on the cardinality of buckets rather than simply time. Relates to elastic/elasticsearch#734 Original commit: elastic/x-pack-elasticsearch@a738e86d21
= Elasticsearch Ml Plugin Behavioral Analytics for Elasticsearch