diff --git a/docs/reference/transform/index.asciidoc b/docs/reference/transform/index.asciidoc index 7242e850d3f..41ffd97ee39 100644 --- a/docs/reference/transform/index.asciidoc +++ b/docs/reference/transform/index.asciidoc @@ -1,73 +1,17 @@ [role="xpack"] [[ml-dataframes]] -= {transforms-cap} += Transforming data [partintro] -- -beta[] - -{es} aggregations are a powerful and flexible feature that enable you to -summarize and retrieve complex insights about your data. You can summarize -complex things like the number of web requests per day on a busy website, broken -down by geography and browser type. If you use the same data set to try to -calculate something as simple as a single number for the average duration of -visitor web sessions, however, you can quickly run out of memory. - -Why does this occur? A web session duration is an example of a behavioral -attribute not held on any one log record; it has to be derived by finding the -first and last records for each session in our weblogs. This derivation requires -some complex query expressions and a lot of memory to connect all the data -points. If you have an ongoing background process that fuses related events from -one index into entity-centric summaries in another index, you get a more useful, -joined-up picture--this is essentially what _{dataframes}_ are. - - -[discrete] -[[ml-dataframes-usage]] -== When to use {dataframes} - -You might want to consider using {dataframes} instead of aggregations when: - -* You need a complete _feature index_ rather than a top-N set of items. -+ -In {ml}, you often need a complete set of behavioral features rather just the -top-N. For example, if you are predicting customer churn, you might look at -features such as the number of website visits in the last week, the total number -of sales, or the number of emails sent. The {stack} {ml-features} create models -based on this multi-dimensional feature space, so they benefit from full feature -indices ({dataframes}). -+ -This scenario also applies when you are trying to search across the results of -an aggregation or multiple aggregations. Aggregation results can be ordered or -filtered, but there are -{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering] -and -{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector] -is constrained by the maximum number of buckets returned. If you want to search -all aggregation results, you need to create the complete {dataframe}. If you -need to sort or filter the aggregation results by multiple fields, {dataframes} -are particularly useful. - -* You need to sort aggregation results by a pipeline aggregation. -+ -{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used -for sorting. Technically, this is because pipeline aggregations are run during -the reduce phase after all other aggregations have already completed. If you -create a {dataframe}, you can effectively perform multiple passes over the data. - -* You want to create summary tables to optimize queries. -+ -For example, if you -have a high level dashboard that is accessed by a large number of users and it -uses a complex aggregation over a large dataset, it may be more efficient to -create a {dataframe} to cache results. Thus, each user doesn't need to run the -aggregation query. - -Though there are multiple ways to create {dataframes}, this content pertains -to one specific method: _{transforms}_. +{transforms-cap} enable you to convert existing {es} indices into summarized +indices, which provide opportunities for new insights and analytics. For example, +you can use {transforms} to pivot your data into entity-centric indices that +summarize the behavior of users or sessions or other entities in your data. * <> +* <> * <> * <> * <> @@ -75,6 +19,7 @@ to one specific method: _{transforms}_. -- include::overview.asciidoc[] +include::usage.asciidoc[] include::checkpoints.asciidoc[] include::api-quickref.asciidoc[] include::dataframe-examples.asciidoc[] diff --git a/docs/reference/transform/troubleshooting.asciidoc b/docs/reference/transform/troubleshooting.asciidoc index 485512ff743..231abc288c1 100644 --- a/docs/reference/transform/troubleshooting.asciidoc +++ b/docs/reference/transform/troubleshooting.asciidoc @@ -1,3 +1,5 @@ +[role="xpack"] +[testenv="basic"] [[dataframe-troubleshooting]] == Troubleshooting {transforms} [subs="attributes"] diff --git a/docs/reference/transform/usage.asciidoc b/docs/reference/transform/usage.asciidoc new file mode 100644 index 00000000000..70dfe0f80b3 --- /dev/null +++ b/docs/reference/transform/usage.asciidoc @@ -0,0 +1,56 @@ +[role="xpack"] +[testenv="basic"] +[[ml-transforms-usage]] +== When to use {transforms} + +{es} aggregations are a powerful and flexible feature that enable you to +summarize and retrieve complex insights about your data. You can summarize +complex things like the number of web requests per day on a busy website, broken +down by geography and browser type. If you use the same data set to try to +calculate something as simple as a single number for the average duration of +visitor web sessions, however, you can quickly run out of memory. + +Why does this occur? A web session duration is an example of a behavioral +attribute not held on any one log record; it has to be derived by finding the +first and last records for each session in our weblogs. This derivation requires +some complex query expressions and a lot of memory to connect all the data +points. If you have an ongoing background process that fuses related events from +one index into entity-centric summaries in another index, you get a more useful, +joined-up picture. This new index is sometimes referred to as a _{dataframe}_. + +You might want to consider using {transforms} instead of aggregations when: + +* You need a complete _feature index_ rather than a top-N set of items. ++ +In {ml}, you often need a complete set of behavioral features rather just the +top-N. For example, if you are predicting customer churn, you might look at +features such as the number of website visits in the last week, the total number +of sales, or the number of emails sent. The {stack} {ml-features} create models +based on this multi-dimensional feature space, so they benefit from the full +feature indices that are created by {transforms}. ++ +This scenario also applies when you are trying to search across the results of +an aggregation or multiple aggregations. Aggregation results can be ordered or +filtered, but there are +{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering] +and +{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector] +is constrained by the maximum number of buckets returned. If you want to search +all aggregation results, you need to create the complete {dataframe}. If you +need to sort or filter the aggregation results by multiple fields, {transforms} +are particularly useful. + +* You need to sort aggregation results by a pipeline aggregation. ++ +{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used +for sorting. Technically, this is because pipeline aggregations are run during +the reduce phase after all other aggregations have already completed. If you +create a {transform}, you can effectively perform multiple passes over the data. + +* You want to create summary tables to optimize queries. ++ +For example, if you +have a high level dashboard that is accessed by a large number of users and it +uses a complex aggregation over a large dataset, it may be more efficient to +create a {transform} to cache results. Thus, each user doesn't need to run the +aggregation query.