[DOCS] Updates transform landing page (#46689)
This commit is contained in:
parent
507d879b70
commit
a32fe1618c
|
@ -1,73 +1,17 @@
|
|||
[role="xpack"]
|
||||
[[ml-dataframes]]
|
||||
= {transforms-cap}
|
||||
= Transforming data
|
||||
|
||||
[partintro]
|
||||
--
|
||||
|
||||
beta[]
|
||||
|
||||
{es} aggregations are a powerful and flexible feature that enable you to
|
||||
summarize and retrieve complex insights about your data. You can summarize
|
||||
complex things like the number of web requests per day on a busy website, broken
|
||||
down by geography and browser type. If you use the same data set to try to
|
||||
calculate something as simple as a single number for the average duration of
|
||||
visitor web sessions, however, you can quickly run out of memory.
|
||||
|
||||
Why does this occur? A web session duration is an example of a behavioral
|
||||
attribute not held on any one log record; it has to be derived by finding the
|
||||
first and last records for each session in our weblogs. This derivation requires
|
||||
some complex query expressions and a lot of memory to connect all the data
|
||||
points. If you have an ongoing background process that fuses related events from
|
||||
one index into entity-centric summaries in another index, you get a more useful,
|
||||
joined-up picture--this is essentially what _{dataframes}_ are.
|
||||
|
||||
|
||||
[discrete]
|
||||
[[ml-dataframes-usage]]
|
||||
== When to use {dataframes}
|
||||
|
||||
You might want to consider using {dataframes} instead of aggregations when:
|
||||
|
||||
* You need a complete _feature index_ rather than a top-N set of items.
|
||||
+
|
||||
In {ml}, you often need a complete set of behavioral features rather just the
|
||||
top-N. For example, if you are predicting customer churn, you might look at
|
||||
features such as the number of website visits in the last week, the total number
|
||||
of sales, or the number of emails sent. The {stack} {ml-features} create models
|
||||
based on this multi-dimensional feature space, so they benefit from full feature
|
||||
indices ({dataframes}).
|
||||
+
|
||||
This scenario also applies when you are trying to search across the results of
|
||||
an aggregation or multiple aggregations. Aggregation results can be ordered or
|
||||
filtered, but there are
|
||||
{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
|
||||
and
|
||||
{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
|
||||
is constrained by the maximum number of buckets returned. If you want to search
|
||||
all aggregation results, you need to create the complete {dataframe}. If you
|
||||
need to sort or filter the aggregation results by multiple fields, {dataframes}
|
||||
are particularly useful.
|
||||
|
||||
* You need to sort aggregation results by a pipeline aggregation.
|
||||
+
|
||||
{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
|
||||
for sorting. Technically, this is because pipeline aggregations are run during
|
||||
the reduce phase after all other aggregations have already completed. If you
|
||||
create a {dataframe}, you can effectively perform multiple passes over the data.
|
||||
|
||||
* You want to create summary tables to optimize queries.
|
||||
+
|
||||
For example, if you
|
||||
have a high level dashboard that is accessed by a large number of users and it
|
||||
uses a complex aggregation over a large dataset, it may be more efficient to
|
||||
create a {dataframe} to cache results. Thus, each user doesn't need to run the
|
||||
aggregation query.
|
||||
|
||||
Though there are multiple ways to create {dataframes}, this content pertains
|
||||
to one specific method: _{transforms}_.
|
||||
{transforms-cap} enable you to convert existing {es} indices into summarized
|
||||
indices, which provide opportunities for new insights and analytics. For example,
|
||||
you can use {transforms} to pivot your data into entity-centric indices that
|
||||
summarize the behavior of users or sessions or other entities in your data.
|
||||
|
||||
* <<ml-transform-overview>>
|
||||
* <<ml-transforms-usage>>
|
||||
* <<df-api-quickref>>
|
||||
* <<dataframe-examples>>
|
||||
* <<dataframe-troubleshooting>>
|
||||
|
@ -75,6 +19,7 @@ to one specific method: _{transforms}_.
|
|||
--
|
||||
|
||||
include::overview.asciidoc[]
|
||||
include::usage.asciidoc[]
|
||||
include::checkpoints.asciidoc[]
|
||||
include::api-quickref.asciidoc[]
|
||||
include::dataframe-examples.asciidoc[]
|
||||
|
|
|
@ -1,3 +1,5 @@
|
|||
[role="xpack"]
|
||||
[testenv="basic"]
|
||||
[[dataframe-troubleshooting]]
|
||||
== Troubleshooting {transforms}
|
||||
[subs="attributes"]
|
||||
|
|
|
@ -0,0 +1,56 @@
|
|||
[role="xpack"]
|
||||
[testenv="basic"]
|
||||
[[ml-transforms-usage]]
|
||||
== When to use {transforms}
|
||||
|
||||
{es} aggregations are a powerful and flexible feature that enable you to
|
||||
summarize and retrieve complex insights about your data. You can summarize
|
||||
complex things like the number of web requests per day on a busy website, broken
|
||||
down by geography and browser type. If you use the same data set to try to
|
||||
calculate something as simple as a single number for the average duration of
|
||||
visitor web sessions, however, you can quickly run out of memory.
|
||||
|
||||
Why does this occur? A web session duration is an example of a behavioral
|
||||
attribute not held on any one log record; it has to be derived by finding the
|
||||
first and last records for each session in our weblogs. This derivation requires
|
||||
some complex query expressions and a lot of memory to connect all the data
|
||||
points. If you have an ongoing background process that fuses related events from
|
||||
one index into entity-centric summaries in another index, you get a more useful,
|
||||
joined-up picture. This new index is sometimes referred to as a _{dataframe}_.
|
||||
|
||||
You might want to consider using {transforms} instead of aggregations when:
|
||||
|
||||
* You need a complete _feature index_ rather than a top-N set of items.
|
||||
+
|
||||
In {ml}, you often need a complete set of behavioral features rather just the
|
||||
top-N. For example, if you are predicting customer churn, you might look at
|
||||
features such as the number of website visits in the last week, the total number
|
||||
of sales, or the number of emails sent. The {stack} {ml-features} create models
|
||||
based on this multi-dimensional feature space, so they benefit from the full
|
||||
feature indices that are created by {transforms}.
|
||||
+
|
||||
This scenario also applies when you are trying to search across the results of
|
||||
an aggregation or multiple aggregations. Aggregation results can be ordered or
|
||||
filtered, but there are
|
||||
{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
|
||||
and
|
||||
{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
|
||||
is constrained by the maximum number of buckets returned. If you want to search
|
||||
all aggregation results, you need to create the complete {dataframe}. If you
|
||||
need to sort or filter the aggregation results by multiple fields, {transforms}
|
||||
are particularly useful.
|
||||
|
||||
* You need to sort aggregation results by a pipeline aggregation.
|
||||
+
|
||||
{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
|
||||
for sorting. Technically, this is because pipeline aggregations are run during
|
||||
the reduce phase after all other aggregations have already completed. If you
|
||||
create a {transform}, you can effectively perform multiple passes over the data.
|
||||
|
||||
* You want to create summary tables to optimize queries.
|
||||
+
|
||||
For example, if you
|
||||
have a high level dashboard that is accessed by a large number of users and it
|
||||
uses a complex aggregation over a large dataset, it may be more efficient to
|
||||
create a {transform} to cache results. Thus, each user doesn't need to run the
|
||||
aggregation query.
|
Loading…
Reference in New Issue