74 lines
4.7 KiB
Plaintext
74 lines
4.7 KiB
Plaintext
[[rollup-overview]]
|
|
== Overview
|
|
|
|
Time-based data (documents that are predominantly identified by their timestamp) often have associated retention policies
|
|
to manage data growth. For example, your system may be generating 500,000 documents every second. That will generate
|
|
43 million documents per day, and nearly 16 billion documents a year.
|
|
|
|
While your analysts and data scientists may wish you stored that data indefinitely for analysis, time is never-ending and
|
|
so your storage requirements will continue to grow without bound. Retention policies are therefore often dictated
|
|
by the simple calculation of storage costs over time, and what the organization is willing to pay to retain historical data.
|
|
Often these policies start deleting data after a few months or years.
|
|
|
|
Storage cost is a fixed quantity. It takes X money to store Y data. But the utility of a piece of data often changes
|
|
with time. Sensor data gathered at millisecond granularity is extremely useful right now, reasonably useful if from a
|
|
few weeks ago, and only marginally useful if older than a few months.
|
|
|
|
So while the cost of storing a millisecond of sensor data from ten years ago is fixed, the value of that individual sensor
|
|
reading often diminishes with time. It's not useless -- it could easily contribute to a useful analysis -- but it's reduced
|
|
value often leads to deletion rather than paying the fixed storage cost.
|
|
|
|
=== Rollup store historical data at reduced granularity
|
|
|
|
That's where Rollup comes into play. The Rollup functionality summarizes old, high-granularity data into a reduced
|
|
granularity format for long-term storage. By "rolling" the data up into a single summary document, historical data
|
|
can be compressed greatly compared to the raw data.
|
|
|
|
For example, consider the system that's generating 43 million documents every day. The second-by-second data is useful
|
|
for real-time analysis, but historical analysis looking over ten years of data are likely to be working at a larger interval
|
|
such as hourly or daily trends.
|
|
|
|
If we compress the 43 million documents into hourly summaries, we can save vast amounts of space. The Rollup feature
|
|
automates this process of summarizing historical data.
|
|
|
|
Details about setting up and configuring Rollup are covered in <<rollup-put-job,Create Job API>>
|
|
|
|
=== Rollup uses standard query DSL
|
|
|
|
The Rollup feature exposes a new search endpoint (`/_rollup_search` vs the standard `/_search`) which knows how to search
|
|
over rolled-up data. Importantly, this endpoint accepts 100% normal {es} Query DSL. Your application does not need to learn
|
|
a new DSL to inspect historical data, it can simply reuse existing queries and dashboards.
|
|
|
|
There are some limitations to the functionality available; not all queries and aggregations are supported, certain search
|
|
features (highlighting, etc) are disabled, and available fields depend on how the rollup was configured. These limitations
|
|
are covered more in <<rollup-search-limitations, Rollup Search limitations>>.
|
|
|
|
But if your queries, aggregations and dashboards only use the available functionality, redirecting them to historical
|
|
data is trivial.
|
|
|
|
=== Rollup merges "live" and "rolled" data
|
|
|
|
A useful feature of Rollup is the ability to query both "live", realtime data in addition to historical "rolled" data
|
|
in a single query.
|
|
|
|
For example, your system may keep a month of raw data. After a month, it is rolled up into historical summaries using
|
|
Rollup and the raw data is deleted.
|
|
|
|
If you were to query the raw data, you'd only see the most recent month. And if you were to query the rolled up data, you
|
|
would only see data older than a month. The RollupSearch endpoint, however, supports querying both at the same time.
|
|
It will take the results from both data sources and merge them together. If there is overlap between the "live" and
|
|
"rolled" data, live data is preferred to increase accuracy.
|
|
|
|
=== Rollup is multi-interval aware
|
|
|
|
Finally, Rollup is capable of intelligently utilizing the best interval available. If you've worked with summarizing
|
|
features of other products, you'll find that they can be limiting. If you configure rollups at daily intervals... your
|
|
queries and charts can only work with daily intervals. If you need a monthly interval, you have to create another rollup
|
|
that explicitly stores monthly averages, etc.
|
|
|
|
The Rollup feature stores data in such a way that queries can identify the smallest available interval and use that
|
|
for their processing. If you store rollups at a daily interval, queries can be executed on daily or longer intervals
|
|
(weekly, monthly, etc) without the need to explicitly configure a new rollup job. This helps alleviate one of the major
|
|
disadvantages of a rollup system; reduced flexibility relative to raw data.
|
|
|