OpenSearch

Commit Graph

Author	SHA1	Message	Date
Zachary Tong	9cc33f4e29	[Rollup] Select best jobs then execute msearch-per-job (elastic/x-pack-elasticsearch#4152 ) If there are multiple jobs that are all the "best" (e.g. share the best interval) we have no way of knowing which is actually the best. Unfortunately, we cannot just filter for all the jobs in a single search because their doc_counts can potentially overlap. To solve this, we execute an msearch-per-job so that the results stay isolated. When rewriting the response, we iteratively unroll and reduce the independent msearch responses into a single "working tree". This allows us to intervene if there are overlapping buckets and manually choose a doc_count. Job selection is found by recursively descending through the aggregation tree and independently pruning the list of valid job caps in each branch. When a leaf node is reached in the branch, the remaining jobs are sorted by "best'ness" (see comparator in RollupJobIdentifierUtils for the implementation) and added to a global set of "best jobs". Once all branches have been evaluated, the final set is returned to the calling code. Job "best'ness" is, briefly, the job(s) that have - The largest compatible date interval - Fewer and larger interval histograms - Fewer terms groups Note: the final set of "best" jobs is not guaranteed to be minimal, there may be redundant effort due to independent branches choosing jobs that are subsets of other branches. Related changes: - We have to include the job's ID in the rollup doc's hash, so that different jobs don't overwrite the same summary document. - Now that we iteratively reduce the agg tree, the agg framework injects empty buckets while we're working. In most cases this is harmless, but for `avg` aggs the empty bucket is a SumAgg while any unrolled versions are converted into AvgAggs... causing a cast exception. To get around this, avg's are renamed to `{source_name}.value` to prevent a conflict - The job filtering has been pushed up into a query filter, since it applies to the entire msearch rather than just individual agg components - We no longer add a filter agg clause about the date_histo's interval, because that is handled by the job validation and pruning. Original commit: elastic/x-pack-elasticsearch@995be2a039	2018-03-27 10:33:59 -07:00
Zachary Tong	e3543b06ba	[Docs] Remove bad cross-book link Temporary to keep the build green, will figure this out during the next round of rollup docs work. Original commit: elastic/x-pack-elasticsearch@7657938ffb	2018-02-23 23:23:51 +00:00
Zachary Tong	bf1550a0b2	Rollups for Elasticsearch (elastic/x-pack-elasticsearch#4002 ) This adds a new Rollup module to XPack, which allows users to configure periodic "rollup jobs" to pre-aggregate data. That data is then available later for search through a special RollupSearch API, which mimics the DSL and functionality of regular search. Rollups are used to drastically reduce the on-disk footprint of metric-based data (e.g. timestamped document with numeric and keyword fields). It can also be used to speed up aggregations over large datasets, since the rolled data will be considerably smaller and fewer documents to search. The PR adds seven new endpoints to interact with Rollups; create/get/delete job, start/stop job, a capabilities API similar to field-caps, and a Rollup-enabled search. Original commit: elastic/x-pack-elasticsearch@dcde91aacf	2018-02-23 17:10:37 -05:00

Author

SHA1

Message

Date

Zachary Tong

9cc33f4e29

[Rollup] Select best jobs then execute msearch-per-job (elastic/x-pack-elasticsearch#4152 )

If there are multiple jobs that are all the "best" (e.g. share the
best interval) we have no way of knowing which is actually the best.
Unfortunately, we cannot just filter for all the jobs in a single
search because their doc_counts can potentially overlap.

To solve this, we execute an msearch-per-job so that the results
stay isolated.  When rewriting the response, we iteratively
unroll and reduce the independent msearch responses into a single
"working tree".  This allows us to intervene if there are
overlapping buckets and manually choose a doc_count.

Job selection is found by recursively descending through the aggregation
tree and independently pruning the list of valid job caps in each branch.
When a leaf node is reached in the branch, the remaining jobs are
sorted by "best'ness" (see comparator in RollupJobIdentifierUtils for the
implementation) and added to a global set of "best jobs". Once
all branches have been evaluated, the final set is returned to the
calling code.

Job "best'ness" is, briefly, the job(s) that have
 - The largest compatible date interval
 - Fewer and larger interval histograms
 - Fewer terms groups

Note: the final set of "best" jobs is not guaranteed to be minimal,
there may be redundant effort due to independent branches choosing
jobs that are subsets of other branches.

Related changes:
- We have to include the job's ID in the rollup doc's
hash, so that different jobs don't overwrite the same summary
document.
- Now that we iteratively reduce the agg tree, the agg framework
injects empty buckets while we're working.  In most cases this
is harmless, but for `avg` aggs the empty bucket is a SumAgg while
any unrolled versions are converted into AvgAggs... causing a cast
exception.  To get around this, avg's are renamed to
`{source_name}.value` to prevent a conflict
- The job filtering has been pushed up into a query filter, since it
applies to the entire msearch rather than just individual agg components
- We no longer add a filter agg clause about the date_histo's interval, because 
that is handled by the job validation and pruning.

Original commit: elastic/x-pack-elasticsearch@995be2a039

2018-03-27 10:33:59 -07:00

Zachary Tong

e3543b06ba

[Docs] Remove bad cross-book link

Temporary to keep the build green, will figure this out during
the next round of rollup docs work.

Original commit: elastic/x-pack-elasticsearch@7657938ffb

2018-02-23 23:23:51 +00:00

Zachary Tong

bf1550a0b2

Rollups for Elasticsearch (elastic/x-pack-elasticsearch#4002 )

This adds a new Rollup module to XPack, which allows users to configure periodic "rollup jobs" to pre-aggregate data.  That data is then available later for search through a special RollupSearch API, which mimics the DSL and functionality of regular search.

Rollups are used to drastically reduce the on-disk footprint of metric-based data (e.g. timestamped document with numeric and keyword fields).  It can also be used to speed up aggregations over large datasets, since the rolled data will be considerably smaller and fewer documents to search.

The PR adds seven new endpoints to interact with Rollups; create/get/delete job, start/stop job, a capabilities API similar to field-caps, and a Rollup-enabled search.

Original commit: elastic/x-pack-elasticsearch@dcde91aacf

2018-02-23 17:10:37 -05:00

3 Commits