OpenSearch/docs/reference/transform/usage.asciidoc

[role="xpack"]
[testenv="basic"]
[[ml-transforms-usage]]
== When to use {transforms}

{es} aggregations are a powerful and flexible feature that enable you to
summarize and retrieve complex insights about your data. You can summarize
complex things like the number of web requests per day on a busy website, broken
down by geography and browser type. If you use the same data set to try to
calculate something as simple as a single number for the average duration of
visitor web sessions, however, you can quickly run out of memory.

Why does this occur? A web session duration is an example of a behavioral
attribute not held on any one log record; it has to be derived by finding the
first and last records for each session in our weblogs. This derivation requires
some complex query expressions and a lot of memory to connect all the data
points. If you have an ongoing background process that fuses related events from
one index into entity-centric summaries in another index, you get a more useful,
joined-up picture. This new index is sometimes referred to as a _{dataframe}_.

You might want to consider using {transforms} instead of aggregations when:

* You need a complete _feature index_ rather than a top-N set of items.
+
In {ml}, you often need a complete set of behavioral features rather just the
top-N. For example, if you are predicting customer churn, you might look at
features such as the number of website visits in the last week, the total number
of sales, or the number of emails sent. The {stack} {ml-features} create models
based on this multi-dimensional feature space, so they benefit from the full
feature indices that are created by {transforms}.
+
This scenario also applies when you are trying to search across the results of
an aggregation or multiple aggregations. Aggregation results can be ordered or
filtered, but there are
{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
and
{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
is constrained by the maximum number of buckets returned. If you want to search
all aggregation results, you need to create the complete {dataframe}. If you
need to sort or filter the aggregation results by multiple fields, {transforms}
are particularly useful.

* You need to sort aggregation results by a pipeline aggregation.
+
{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
for sorting. Technically, this is because pipeline aggregations are run during
the reduce phase after all other aggregations have already completed. If you
create a {transform}, you can effectively perform multiple passes over the data.

* You want to create summary tables to optimize queries.
+
For example, if you
have a high level dashboard that is accessed by a large number of users and it
uses a complex aggregation over a large dataset, it may be more efficient to
create a {transform} to cache results. Thus, each user doesn't need to run the
aggregation query.
[DOCS] Updates transform landing page (#46689) 2019-09-18 16:30:16 -04:00			`[role="xpack"]`
			`[testenv="basic"]`
			`[[ml-transforms-usage]]`
			`== When to use {transforms}`

			`{es} aggregations are a powerful and flexible feature that enable you to`
			`summarize and retrieve complex insights about your data. You can summarize`
			`complex things like the number of web requests per day on a busy website, broken`
			`down by geography and browser type. If you use the same data set to try to`
			`calculate something as simple as a single number for the average duration of`
			`visitor web sessions, however, you can quickly run out of memory.`

			`Why does this occur? A web session duration is an example of a behavioral`
			`attribute not held on any one log record; it has to be derived by finding the`
			`first and last records for each session in our weblogs. This derivation requires`
			`some complex query expressions and a lot of memory to connect all the data`
			`points. If you have an ongoing background process that fuses related events from`
			`one index into entity-centric summaries in another index, you get a more useful,`
			`joined-up picture. This new index is sometimes referred to as a _{dataframe}_.`

			`You might want to consider using {transforms} instead of aggregations when:`

			`* You need a complete _feature index_ rather than a top-N set of items.`
			`+`
			`In {ml}, you often need a complete set of behavioral features rather just the`
			`top-N. For example, if you are predicting customer churn, you might look at`
			`features such as the number of website visits in the last week, the total number`
			`of sales, or the number of emails sent. The {stack} {ml-features} create models`
			`based on this multi-dimensional feature space, so they benefit from the full`
			`feature indices that are created by {transforms}.`
			`+`
			`This scenario also applies when you are trying to search across the results of`
			`an aggregation or multiple aggregations. Aggregation results can be ordered or`
			`filtered, but there are`
			`{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]`
			`and`
			`{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]`
			`is constrained by the maximum number of buckets returned. If you want to search`
			`all aggregation results, you need to create the complete {dataframe}. If you`
			`need to sort or filter the aggregation results by multiple fields, {transforms}`
			`are particularly useful.`

			`* You need to sort aggregation results by a pipeline aggregation.`
			`+`
			`{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used`
			`for sorting. Technically, this is because pipeline aggregations are run during`
			`the reduce phase after all other aggregations have already completed. If you`
			`create a {transform}, you can effectively perform multiple passes over the data.`

			`* You want to create summary tables to optimize queries.`
			`+`
			`For example, if you`
			`have a high level dashboard that is accessed by a large number of users and it`
			`uses a complex aggregation over a large dataset, it may be more efficient to`
			`create a {transform} to cache results. Thus, each user doesn't need to run the`
			`aggregation query.`