[DOCS] Adds transform content (#46575) (#46578)

2025-03-24 17:09:48 +00:00 · 2019-09-11 08:44:03 -07:00 · 2019-09-11 08:44:03 -07:00 · c0ec6ade4b
commit c0ec6ade4b
parent 461de5b58e
15 changed files with 1107 additions and 0 deletions
--- a/docs/reference/transform/api-quickref.asciidoc
+++ b/docs/reference/transform/api-quickref.asciidoc
@ -0,0 +1,21 @@
+[role="xpack"]
+[[df-api-quickref]]
+== API quick reference
+
+All {dataframe-transform} endpoints have the following base:
+
+[source,js]
+----
+/_data_frame/transforms/
+----
+// NOTCONSOLE
+
+* {ref}/put-data-frame-transform.html[Create {dataframe-transforms}]
+* {ref}/delete-data-frame-transform.html[Delete {dataframe-transforms}]
+* {ref}/get-data-frame-transform.html[Get {dataframe-transforms}]
+* {ref}/get-data-frame-transform-stats.html[Get {dataframe-transforms} statistics]
+* {ref}/preview-data-frame-transform.html[Preview {dataframe-transforms}]
+* {ref}/start-data-frame-transform.html[Start {dataframe-transforms}]
+* {ref}/stop-data-frame-transform.html[Stop {dataframe-transforms}]
+
+For the full list, see {ref}/data-frame-apis.html[{dataframe-transform-cap} APIs].
--- a/docs/reference/transform/checkpoints.asciidoc
+++ b/docs/reference/transform/checkpoints.asciidoc
@ -0,0 +1,88 @@
+[role="xpack"]
+[[ml-transform-checkpoints]]
+== How {dataframe-transform} checkpoints work
++++
+<titleabbrev>How checkpoints work</titleabbrev>
++++
+
+beta[]
+
+Each time a {dataframe-transform} examines the source indices and creates or
+updates the destination index, it generates a _checkpoint_.
+
+If your {dataframe-transform} runs only once, there is logically only one
+checkpoint. If your {dataframe-transform} runs continuously, however, it creates
+checkpoints as it ingests and transforms new source data.
+
+To create a checkpoint, the {cdataframe-transform}:
+
+. Checks for changes to source indices.
+
+Using a simple periodic timer, the {dataframe-transform} checks for changes to
+the source indices. This check is done based on the interval defined in the
+transform's `frequency` property.
+
+If the source indices remain unchanged or if a checkpoint is already in progress
+then it waits for the next timer.
+
+. Identifies which entities have changed.
+
+The {dataframe-transform} searches to see which entities have changed since the
+last time it checked. The transform's `sync` configuration object identifies a
+time field in the source indices. The transform uses the values in that field to
+synchronize the source and destination indices.
+ 
+. Updates the destination index (the {dataframe}) with the changed entities.
+
+--
+The {dataframe-transform} applies changes related to either new or changed
+entities to the destination index. The set of changed entities is paginated. For
+each page, the {dataframe-transform} performs a composite aggregation using a
+`terms` query. After all the pages of changes have been applied, the checkpoint
+is complete.
+--
+
+This checkpoint process involves both search and indexing activity on the
+cluster. We have attempted to favor control over performance while developing
+{dataframe-transforms}. We decided it was preferable for the
+{dataframe-transform} to take longer to complete, rather than to finish quickly
+and take precedence in resource consumption. That being said, the cluster still
+requires enough resources to support both the composite aggregation search and
+the indexing of its results. 
+
+TIP: If the cluster experiences unsuitable performance degradation due to the
+{dataframe-transform}, stop the transform. Consider whether you can apply a
+source query to the {dataframe-transform} to reduce the scope of data it
+processes. Also consider whether the cluster has sufficient resources in place
+to support both the composite aggregation search and the indexing of its
+results.
+
+[discrete]
+[[ml-transform-checkpoint-errors]]
+==== Error handling
+
+Failures in {dataframe-transforms} tend to be related to searching or indexing.
+To increase the resiliency of {dataframe-transforms}, the cursor positions of
+the aggregated search and the changed entities search are tracked in memory and
+persisted periodically.
+
+Checkpoint failures can be categorized as follows:
+
+* Temporary failures: The checkpoint is retried. If 10 consecutive failures
+occur, the {dataframe-transform} has a failed status. For example, this
+situation might occur when there are shard failures and queries return only
+partial results.
+* Irrecoverable failures: The {dataframe-transform} immediately fails. For
+example, this situation occurs when the source index is not found.
+* Adjustment failures: The {dataframe-transform} retries with adjusted settings.
+For example, if a parent circuit breaker memory errors occur during the
+composite aggregation, the transform receives partial results. The aggregated
+search is retried with a smaller number of buckets. This retry is performed at
+the interval defined in the transform's `frequency` property. If the search
+is retried to the point where it reaches a minimal number of buckets, an
+irrecoverable failure occurs.
+
+If the node running the {dataframe-transforms} fails, the transform restarts
+from the most recent persisted cursor position. This recovery process might
+repeat some of the work the transform had already done, but it ensures data
+consistency.
--- a/docs/reference/transform/dataframe-examples.asciidoc
+++ b/docs/reference/transform/dataframe-examples.asciidoc
@ -0,0 +1,335 @@
+[role="xpack"]
+[testenv="basic"]
+[[dataframe-examples]]
+== {dataframe-transform-cap} examples
++++
+<titleabbrev>Examples</titleabbrev>
++++
+
+beta[]
+
+These examples demonstrate how to use {dataframe-transforms} to derive useful 
+insights from your data. All the examples use one of the 
+{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed, 
+step-by-step example, see 
+<<ecommerce-dataframes,Transforming your data with {dataframes}>>.
+
+* <<ecommerce-dataframes>>
+* <<example-best-customers>>
+* <<example-airline>>
+* <<example-clientips>>
+
+include::ecommerce-example.asciidoc[]
+
+[[example-best-customers]]
+=== Finding your best customers
+
+In this example, we use the eCommerce orders sample dataset to find the customers 
+who spent the most in our hypothetical webshop. Let's transform the data such 
+that the destination index contains the number of orders, the total price of 
+the orders, the amount of unique products and the average price per order, 
+and the total amount of ordered products for each customer.
+
+[source,console]
+----------------------------------
+POST _data_frame/transforms/_preview
+{
+  "source": {
+    "index": "kibana_sample_data_ecommerce"
+  },
+  "dest" : { <1>
+    "index" : "sample_ecommerce_orders_by_customer"
+  },
+  "pivot": {
+    "group_by": { <2>
+      "user": { "terms": { "field": "user" }}, 
+      "customer_id": { "terms": { "field": "customer_id" }}
+    },
+    "aggregations": {
+      "order_count": { "value_count": { "field": "order_id" }},
+      "total_order_amt": { "sum": { "field": "taxful_total_price" }},
+      "avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
+      "avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
+      "total_unique_products": { "cardinality": { "field": "products.product_id" }}
+    }
+  }
+}
+----------------------------------
+// TEST[skip:setup kibana sample data]
+
+<1> This is the destination index for the {dataframe}. It is ignored by 
+`_preview`.
+<2> Two `group_by` fields have been selected. This means the {dataframe} will 
+contain a unique row per `user` and `customer_id` combination. Within this 
+dataset both these fields are unique. By including both in the {dataframe} it 
+gives more context to the final results.
+
+NOTE: In the example above, condensed JSON formatting has been used for easier 
+readability of the pivot object.
+
+The preview {dataframe-transforms} API enables you to see the layout of the 
+{dataframe} in advance, populated with some sample values. For example:
+
+[source,js]
+----------------------------------
+{
+  "preview" : [
+    {
+      "total_order_amt" : 3946.9765625,
+      "order_count" : 59.0,
+      "total_unique_products" : 116.0,
+      "avg_unique_products_per_order" : 2.0,
+      "customer_id" : "10",
+      "user" : "recip",
+      "avg_amt_per_order" : 66.89790783898304
+    },
+    ...
+    ]
+  }
+----------------------------------
+// NOTCONSOLE
+
+This {dataframe} makes it easier to answer questions such as:
+
+* Which customers spend the most?
+
+* Which customers spend the most per order?
+
+* Which customers order most often?
+
+* Which customers ordered the least number of different products?
+
+It's possible to answer these questions using aggregations alone, however 
+{dataframes} allow us to persist this data as a customer centric index. This 
+enables us to analyze data at scale and gives more flexibility to explore and 
+navigate data from a customer centric perspective. In some cases, it can even 
+make creating visualizations much simpler.
+
+[[example-airline]]
+=== Finding air carriers with the most delays
+
+In this example, we use the Flights sample dataset to find out which air carrier 
+had the most delays. First, we filter the source data such that it excludes all 
+the cancelled flights by using a query filter. Then we transform the data to 
+contain the distinct number of flights, the sum of delayed minutes, and the sum 
+of the flight minutes by air carrier. Finally, we use a 
+{ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`] 
+to determine what percentage of the flight time was actually delay.
+
+[source,console]
+----------------------------------
+POST _data_frame/transforms/_preview
+{
+  "source": {
+    "index": "kibana_sample_data_flights",
+    "query": { <1>
+      "bool": {
+        "filter": [
+          { "term":  { "Cancelled": false } }
+        ]
+      }
+    }
+  },
+  "dest" : { <2>
+    "index" : "sample_flight_delays_by_carrier"
+  },
+  "pivot": {
+    "group_by": { <3>
+      "carrier": { "terms": { "field": "Carrier" }}
+    },
+    "aggregations": {
+      "flights_count": { "value_count": { "field": "FlightNum" }},
+      "delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
+      "flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
+      "delay_time_percentage": { <4>
+        "bucket_script": {
+          "buckets_path": {
+            "delay_time": "delay_mins_total.value",
+            "flight_time": "flight_mins_total.value"
+          },
+          "script": "(params.delay_time / params.flight_time) * 100"
+        }
+      }
+    }
+  }
+}
+----------------------------------
+// TEST[skip:setup kibana sample data]
+
+<1> Filter the source data to select only flights that were not cancelled.
+<2> This is the destination index for the {dataframe}. It is ignored by 
+`_preview`.
+<3> The data is grouped by the `Carrier` field which contains the airline name.
+<4> This `bucket_script` performs calculations on the results that are returned 
+by the aggregation. In this particular example, it calculates what percentage of 
+travel time was taken up by delays.
+
+The preview shows you that the new index would contain data like this for each 
+carrier:
+
+[source,js]
+----------------------------------
+{
+  "preview" : [
+    {
+      "carrier" : "ES-Air",
+      "flights_count" : 2802.0,
+      "flight_mins_total" : 1436927.5130677223,
+      "delay_time_percentage" : 9.335543983955839,
+      "delay_mins_total" : 134145.0
+    },
+    ...
+  ]
+}
+----------------------------------
+// NOTCONSOLE
+
+This {dataframe} makes it easier to answer questions such as:
+
+* Which air carrier has the most delays as a percentage of flight time?
+
+NOTE: This data is fictional and does not reflect actual delays 
+or flight stats for any of the featured destination or origin airports.
+
+
+[[example-clientips]]
+=== Finding suspicious client IPs by using scripted metrics
+
+With {dataframe-transforms}, you can use 
+{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted 
+metric aggregations] on your data. These aggregations are flexible and make 
+it possible to perform very complex processing. Let's use scripted metrics to 
+identify suspicious client IPs in the web log sample dataset.
+
+We transform the data such that the new index contains the sum of bytes and the 
+number of distinct URLs, agents, incoming requests by location, and geographic 
+destinations for each client IP. We also use a scripted field to count the 
+specific types of HTTP responses that each client IP receives. Ultimately, the 
+example below transforms web log data into an entity centric index where the 
+entity is `clientip`.
+
+[source,console]
+----------------------------------
+POST _data_frame/transforms/_preview
+{
+  "source": {
+    "index": "kibana_sample_data_logs",
+    "query": { <1>
+      "range" : {
+        "timestamp" : {
+          "gte" : "now-30d/d"
+        }
+      }
+    }
+  },
+  "dest" : { <2>
+    "index" : "sample_weblogs_by_clientip"
+  },  
+  "pivot": {
+    "group_by": {  <3>
+      "clientip": { "terms": { "field": "clientip" } }
+      },
+    "aggregations": {
+      "url_dc": { "cardinality": { "field": "url.keyword" }},
+      "bytes_sum": { "sum": { "field": "bytes" }},
+      "geo.src_dc": { "cardinality": { "field": "geo.src" }},
+      "agent_dc": { "cardinality": { "field": "agent.keyword" }},
+      "geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
+      "responses.total": { "value_count": { "field": "timestamp" }},
+      "responses.counts": { <4>
+        "scripted_metric": { 
+          "init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
+          "map_script": """
+            def code = doc['response.keyword'].value;
+            if (code.startsWith('5') || code.startsWith('4')) {
+              state.responses.error += 1 ;
+            } else if(code.startsWith('2')) {
+              state.responses.success += 1;
+            } else {
+              state.responses.other += 1;
+            }
+            """,
+          "combine_script": "state.responses",
+          "reduce_script": """
+            def counts = ['error': 0L, 'success': 0L, 'other': 0L];
+            for (responses in states) {
+              counts.error += responses['error'];
+              counts.success += responses['success'];
+              counts.other += responses['other'];
+            }
+            return counts;
+            """
+          }
+        },
+      "timestamp.min": { "min": { "field": "timestamp" }},
+      "timestamp.max": { "max": { "field": "timestamp" }},
+      "timestamp.duration_ms": { <5>
+        "bucket_script": {
+          "buckets_path": {
+            "min_time": "timestamp.min.value",
+            "max_time": "timestamp.max.value"
+          },
+          "script": "(params.max_time - params.min_time)"
+        }
+      }
+    }
+  }
+}
+----------------------------------
+// TEST[skip:setup kibana sample data]
+
+<1> This range query limits the transform to documents that are within the last 
+30 days at the point in time the {dataframe-transform} checkpoint is processed. 
+For batch {dataframes} this occurs once.
+<2> This is the destination index for the {dataframe}. It is ignored by 
+`_preview`.
+<3> The data is grouped by the `clientip` field. 
+<4> This `scripted_metric` performs a distributed operation on the web log data 
+to count specific types of HTTP responses (error, success, and other).
+<5> This `bucket_script` calculates the duration of the `clientip` access based 
+on the results of the aggregation.
+
+The preview shows you that the new index would contain data like this for each 
+client IP:
+
+[source,js]
+----------------------------------
+{
+  "preview" : [
+    {
+      "geo" : {
+        "src_dc" : 12.0,
+        "dest_dc" : 9.0
+      },
+      "clientip" : "0.72.176.46",
+      "agent_dc" : 3.0,
+      "responses" : {
+        "total" : 14.0,
+        "counts" : {
+          "other" : 0,
+          "success" : 14,
+          "error" : 0
+        }
+      },
+      "bytes_sum" : 74808.0,
+      "timestamp" : {
+        "duration_ms" : 4.919943239E9,
+        "min" : "2019-06-17T07:51:57.333Z",
+        "max" : "2019-08-13T06:31:00.572Z"
+      },
+      "url_dc" : 11.0
+    },
+    ...
+  }
+----------------------------------  
+// NOTCONSOLE
+
+This {dataframe} makes it easier to answer questions such as:
+
+* Which client IPs are transferring the most amounts of data?
+
+* Which client IPs are interacting with a high number of different URLs?
+  
+* Which client IPs have high error rates?
+  
+* Which client IPs are interacting with a high number of destination countries?
--- a/docs/reference/transform/ecommerce-example.asciidoc
+++ b/docs/reference/transform/ecommerce-example.asciidoc
@ -0,0 +1,262 @@
+[role="xpack"]
+[testenv="basic"]
+[[ecommerce-dataframes]]
+=== Transforming the eCommerce sample data
+
+beta[]
+
+<<ml-dataframes,{dataframe-transforms-cap}>> enable you to retrieve information
+from an {es} index, transform it, and store it in another index. Let's use the
+{kibana-ref}/add-sample-data.html[{kib} sample data] to demonstrate how you can
+pivot and summarize your data with {dataframe-transforms}.
+
+
+. If the {es} {security-features} are enabled, obtain a user ID with sufficient
+privileges to complete these steps. 
+
+--
+You need `manage_data_frame_transforms` cluster privileges to preview and create
+{dataframe-transforms}. Members of the built-in `data_frame_transforms_admin`
+role have these privileges.
+
+You also need `read` and `view_index_metadata` index privileges on the source
+index and `read`, `create_index`, and `index` privileges on the destination
+index. 
+
+For more information, see <<security-privileges>> and <<built-in-roles>>.
+--
+
+. Choose your _source index_.
+
+--
+In this example, we'll use the eCommerce orders sample data. If you're not
+already familiar with the `kibana_sample_data_ecommerce` index, use the
+*Revenue* dashboard in {kib} to explore the data. Consider what insights you
+might want to derive from this eCommerce data.
+--
+
+. Play with various options for grouping and aggregating the data. 
+
+--
+For example, you might want to group the data by product ID and calculate the
+total number of sales for each product and its average price. Alternatively, you
+might want to look at the behavior of individual customers and calculate how
+much each customer spent in total and how many different categories of products
+they purchased. Or you might want to take the currencies or geographies into
+consideration. What are the most interesting ways you can transform and
+interpret this data?
+
+_Pivoting_ your data involves using at least one field to group it and applying
+at least one aggregation. You can preview what the transformed data will look
+like, so go ahead and play with it!
+
+For example, go to *Machine Learning* > *Data Frames* in {kib} and use the
+wizard to create a {dataframe-transform}:
+
+[role="screenshot"]
+image::images/ecommerce-pivot1.jpg["Creating a simple {dataframe-transform} in {kib}"]
+
+In this case, we grouped the data by customer ID and calculated the sum of
+products each customer purchased.
+
+Let's add some more aggregations to learn more about our customers' orders. For
+example, let's calculate the total sum of their purchases, the maximum number of
+products that they purchased in a single order, and their total number of orders.
+We'll accomplish this by using the
+{ref}/search-aggregations-metrics-sum-aggregation.html[`sum` aggregation] on the
+`taxless_total_price` field, the
+{ref}/search-aggregations-metrics-max-aggregation.html[`max` aggregation] on the
+`total_quantity` field, and the
+{ref}/search-aggregations-metrics-cardinality-aggregation.html[`cardinality` aggregation]
+on the `order_id` field:
+
+[role="screenshot"]
+image::images/ecommerce-pivot2.jpg["Adding multiple aggregations to a {dataframe-transform} in {kib}"]
+
+TIP: If you're interested in a subset of the data, you can optionally include a
+{ref}/search-request-body.html#request-body-search-query[query] element. In this
+example, we've filtered the data so that we're only looking at orders with a
+`currency` of `EUR`. Alternatively, we could group the data by that field too.
+If you want to use more complex queries, you can create your {dataframe} from a
+{kibana-ref}/save-open-search.html[saved search].
+
+If you prefer, you can use the
+{ref}/preview-data-frame-transform.html[preview {dataframe-transforms} API]:
+
+[source,js]
+--------------------------------------------------
+POST _data_frame/transforms/_preview
+{
+  "source": {
+    "index": "kibana_sample_data_ecommerce",
+    "query": {
+      "bool": {
+        "filter": {
+          "term": {"currency": "EUR"}
+        }
+      }
+    }
+  },
+  "pivot": {
+    "group_by": {
+      "customer_id": {
+        "terms": {
+          "field": "customer_id"
+        }
+      }
+    },
+    "aggregations": {
+      "total_quantity.sum": {
+        "sum": {
+          "field": "total_quantity"
+        }
+      },
+      "taxless_total_price.sum": {
+        "sum": {
+          "field": "taxless_total_price"
+        }
+      },
+      "total_quantity.max": {
+        "max": {
+          "field": "total_quantity"
+        }
+      },
+      "order_id.cardinality": {
+        "cardinality": {
+          "field": "order_id"
+        }
+      }
+    }
+  }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[skip:set up sample data]
+--
+
+. When you are satisfied with what you see in the preview, create the
+{dataframe-transform}. 
+
+--
+.. Supply a job ID and the name of the target (or _destination_) index.
+
+.. Decide whether you want the {dataframe-transform} to run once or continuously.
+--
+
+--
+Since this sample data index is unchanging, let's use the default behavior and
+just run the {dataframe-transform} once.
+
+[role="screenshot"]
+image::images/ecommerce-batch.jpg["Specifying the {dataframe-transform} options in {kib}"]
+
+If you want to try it out, however, go ahead and click on *Continuous mode*. 
+You must choose a field that the {dataframe-transform} can use to check which
+entities have changed. In general, it's a good idea to use the ingest timestamp
+field. In this example, however, you can use the `order_date` field.
+
+If you prefer, you can use the
+{ref}/put-data-frame-transform.html[create {dataframe-transforms} API]. For
+example:
+
+[source,js]
+--------------------------------------------------
+PUT _data_frame/transforms/ecommerce-customer-transform
+{
+  "source": {
+    "index": [
+      "kibana_sample_data_ecommerce"
+    ],
+    "query": {
+      "bool": {
+        "filter": {
+          "term": {
+            "currency": "EUR"
+          }
+        }
+      }
+    }
+  },
+  "pivot": {
+    "group_by": {
+      "customer_id": {
+        "terms": {
+          "field": "customer_id"
+        }
+      }
+    },
+    "aggregations": {
+      "total_quantity.sum": {
+        "sum": {
+          "field": "total_quantity"
+        }
+      },
+      "taxless_total_price.sum": {
+        "sum": {
+          "field": "taxless_total_price"
+        }
+      },
+      "total_quantity.max": {
+        "max": {
+          "field": "total_quantity"
+        }
+      },
+      "order_id.cardinality": {
+        "cardinality": {
+          "field": "order_id"
+        }
+      }
+    }
+  },
+  "dest": {
+    "index": "ecommerce-customers"
+  }
+}
+--------------------------------------------------
+// CONSOLE
+// TEST[skip:setup kibana sample data]
+--
+
+. Start the {dataframe-transform}.
+
+--
+
+TIP: Even though resource utilization is automatically adjusted based on the
+cluster load, a {dataframe-transform} increases search and indexing load on your
+cluster while it runs. If you're experiencing an excessive load, however, you
+can stop it.
+
+You can start, stop, and manage {dataframe-transforms} in {kib}:
+
+[role="screenshot"]
+image::images/dataframe-transforms.jpg["Managing {dataframe-transforms} in {kib}"]
+
+Alternatively, you can use the
+{ref}/start-data-frame-transform.html[start {dataframe-transforms}] and
+{ref}/stop-data-frame-transform.html[stop {dataframe-transforms}] APIs. For
+example:
+
+[source,js]
+--------------------------------------------------
+POST _data_frame/transforms/ecommerce-customer-transform/_start
+--------------------------------------------------
+// CONSOLE
+// TEST[skip:setup kibana sample data]
+
+--
+
+. Explore the data in your new index.
+
+--
+For example, use the *Discover* application in {kib}:
+
+[role="screenshot"]
+image::images/ecommerce-results.jpg["Exploring the new index in {kib}"]
+
+--
+
+TIP: If you do not want to keep the {dataframe-transform}, you can delete it in
+{kib} or use the
+{ref}/delete-data-frame-transform.html[delete {dataframe-transform} API]. When
+you delete a {dataframe-transform}, its destination index and {kib} index
+patterns remain.
--- a/docs/reference/transform/images/dataframe-transforms.jpg
+++ b/docs/reference/transform/images/dataframe-transforms.jpg
--- a/docs/reference/transform/images/ecommerce-batch.jpg
+++ b/docs/reference/transform/images/ecommerce-batch.jpg
--- a/docs/reference/transform/images/ecommerce-continuous.jpg
+++ b/docs/reference/transform/images/ecommerce-continuous.jpg
--- a/docs/reference/transform/images/ecommerce-pivot1.jpg
+++ b/docs/reference/transform/images/ecommerce-pivot1.jpg
--- a/docs/reference/transform/images/ecommerce-pivot2.jpg
+++ b/docs/reference/transform/images/ecommerce-pivot2.jpg
--- a/docs/reference/transform/images/ecommerce-results.jpg
+++ b/docs/reference/transform/images/ecommerce-results.jpg
--- a/docs/reference/transform/images/ml-dataframepivot.jpg
+++ b/docs/reference/transform/images/ml-dataframepivot.jpg
--- a/docs/reference/transform/index.asciidoc
+++ b/docs/reference/transform/index.asciidoc
@ -0,0 +1,82 @@
+[role="xpack"]
+[[ml-dataframes]]
+= {dataframe-transforms-cap}
+
+[partintro]
+--
+
+beta[]
+
+{es} aggregations are a powerful and flexible feature that enable you to
+summarize and retrieve complex insights about your data. You can summarize
+complex things like the number of web requests per day on a busy website, broken
+down by geography and browser type. If you use the same data set to try to
+calculate something as simple as a single number for the average duration of
+visitor web sessions, however, you can quickly run out of memory.
+
+Why does this occur? A web session duration is an example of a behavioral
+attribute not held on any one log record; it has to be derived by finding the
+first and last records for each session in our weblogs. This derivation requires
+some complex query expressions and a lot of memory to connect all the data
+points. If you have an ongoing background process that fuses related events from
+one index into entity-centric summaries in another index, you get a more useful,
+joined-up picture--this is essentially what _{dataframes}_ are.
+
+
+[discrete]
+[[ml-dataframes-usage]]
+== When to use {dataframes}
+
+You might want to consider using {dataframes} instead of aggregations when:
+
+* You need a complete _feature index_ rather than a top-N set of items.
+
+In {ml}, you often need a complete set of behavioral features rather just the
+top-N. For example, if you are predicting customer churn, you might look at
+features such as the number of website visits in the last week, the total number
+of sales, or the number of emails sent. The {stack} {ml-features} create models
+based on this multi-dimensional feature space, so they benefit from full feature
+indices ({dataframes}).
+
+This scenario also applies when you are trying to search across the results of
+an aggregation or multiple aggregations. Aggregation results can be ordered or
+filtered, but there are
+{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
+and
+{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
+is constrained by the maximum number of buckets returned. If you want to search
+all aggregation results, you need to create the complete {dataframe}. If you
+need to sort or filter the aggregation results by multiple fields, {dataframes}
+are particularly useful.
+
+* You need to sort aggregation results by a pipeline aggregation.
+
+{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
+for sorting. Technically, this is because pipeline aggregations are run during
+the reduce phase after all other aggregations have already completed. If you
+create a {dataframe}, you can effectively perform multiple passes over the data.
+
+* You want to create summary tables to optimize queries.
+
+For example, if you
+have a high level dashboard that is accessed by a large number of users and it
+uses a complex aggregation over a large dataset, it may be more efficient to
+create a {dataframe} to cache results. Thus, each user doesn't need to run the
+aggregation query.
+
+Though there are multiple ways to create {dataframes}, this content pertains
+to one specific method: _{dataframe-transforms}_.
+
+* <<ml-transform-overview>>
+* <<df-api-quickref>>
+* <<dataframe-examples>>
+* <<dataframe-troubleshooting>>
+* <<dataframe-limitations>>
+--
+
+include::overview.asciidoc[]
+include::checkpoints.asciidoc[]
+include::api-quickref.asciidoc[]
+include::dataframe-examples.asciidoc[]
+include::troubleshooting.asciidoc[]
+include::limitations.asciidoc[]
--- a/docs/reference/transform/limitations.asciidoc
+++ b/docs/reference/transform/limitations.asciidoc
@ -0,0 +1,219 @@
+[role="xpack"]
+[[dataframe-limitations]]
+== {dataframe-transform-cap} limitations
+[subs="attributes"]
++++
+<titleabbrev>Limitations</titleabbrev>
++++
+
+beta[]
+
+The following limitations and known problems apply to the 7.4 release of 
+the Elastic {dataframe} feature:
+
+[float]
+[[df-compatibility-limitations]]
+=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards compatibility
+
+Whilst {dataframe-transforms} are beta, it is not guaranteed that a 
+{dataframe-transform} created in a previous version of the {stack} will be able 
+to start and operate in a future version. Neither can support be provided for 
+{dataframe-transform} tasks to be able to operate in a cluster with mixed node 
+versions. 
+Please note that the output of a {dataframe-transform} is persisted to a 
+destination index. This is a normal {es} index and is not affected by the beta 
+status. 
+
+[float]
+[[df-ui-limitation]]
+=== {dataframe-cap} UI will not work during a rolling upgrade from 7.2
+
+If your cluster contains mixed version nodes, for example during a rolling 
+upgrade from 7.2 to a newer version, and {dataframe-transforms} have been 
+created in 7.2, the {dataframe} UI will not work. Please wait until all nodes 
+have been upgraded to the newer version before using the {dataframe} UI.
+
+
+[float]
+[[df-datatype-limitations]]
+=== {dataframe-cap} data type limitation
+
+{dataframes-cap} do not (yet) support fields containing arrays – in the UI or 
+the API. If you try to create one, the UI will fail to show the source index 
+table.
+
+[float]
+[[df-ccs-limitations]]
+=== {ccs-cap} is not supported
+
+{ccs-cap} is not supported for {dataframe-transforms}.
+
+[float]
+[[df-kibana-limitations]]
+=== Up to 1,000 {dataframe-transforms} are supported
+
+A single cluster will support up to 1,000 {dataframe-transforms}.
+When using the 
+{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total 
+`count` of transforms is returned. Use the `size` and `from` parameters to 
+enumerate through the full list.
+
+[float]
+[[df-aggresponse-limitations]]
+=== Aggregation responses may be incompatible with destination index mappings
+
+When a {dataframe-transform} is first started, it will deduce the mappings 
+required for the destination index. This process is based on the field types of 
+the source index and the aggregations used. If the fields are derived from 
+{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`] 
+or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`], 
+{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the 
+deduced mappings may be incompatible with the actual data. For example, numeric 
+overflows might occur or dynamically mapped fields might contain both numbers 
+and strings. Please check {es} logs if you think this may have occurred. As a 
+workaround, you may define custom mappings prior to starting the 
+{dataframe-transform}. For example, 
+{ref}/indices-create-index.html[create a custom destination index] or 
+{ref}/indices-templates.html[define an index template].
+
+[float]
+[[df-batch-limitations]]
+=== Batch {dataframe-transforms} may not account for changed documents
+
+A batch {dataframe-transform} uses a 
+{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation]
+which allows efficient pagination through all buckets. Composite aggregations 
+do not yet support a search context, therefore if the source data is changed 
+(deleted, updated, added) while the batch {dataframe} is in progress, then the 
+results may not include these changes.
+
+[float]
+[[df-consistency-limitations]]
+=== {cdataframe-cap} consistency does not account for deleted or updated documents
+
+While the process for {cdataframe-transforms} allows the continual recalculation 
+of the {dataframe-transform} as new data is being ingested, it does also have 
+some limitations.
+
+Changed entities will only be identified if their time field 
+has also been updated and falls within the range of the action to check for 
+changes. This has been designed in principle for, and is suited to, the use case 
+where new data is given a timestamp for the time of ingest. 
+
+If the indices that fall within the scope of the source index pattern are 
+removed, for example when deleting historical time-based indices, then the 
+composite aggregation performed in consecutive checkpoint processing will search 
+over different source data, and entities that only existed in the deleted index 
+will not be removed from the {dataframe} destination index.
+
+Depending on your use case, you may wish to recreate the {dataframe-transform} 
+entirely after deletions. Alternatively, if your use case is tolerant to 
+historical archiving, you may wish to include a max ingest timestamp in your 
+aggregation. This will allow you to exclude results that have not been recently 
+updated when viewing the {dataframe} destination index.
+
+
+[float]
+[[df-deletion-limitations]]
+=== Deleting a {dataframe-transform} does not delete the {dataframe} destination index or {kib} index pattern
+
+When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index` 
+neither the {dataframe} destination index nor the {kib} index pattern, should 
+one have been created, are deleted. These objects must be deleted separately.
+
+[float]
+[[df-aggregation-page-limitations]]
+=== Handling dynamic adjustment of aggregation page size
+
+During the development of {dataframe-transforms}, control was favoured over 
+performance. In the design considerations, it is preferred for the 
+{dataframe-transform} to take longer to complete quietly in the background 
+rather than to finish quickly and take precedence in resource consumption.
+
+Composite aggregations are well suited for high cardinality data enabling 
+pagination through results. If a {ref}/circuit-breaker.html[circuit breaker] 
+memory exception occurs when performing the composite aggregated search then we 
+try again reducing the number of buckets requested. This circuit breaker is 
+calculated based upon all activity within the cluster, not just activity from 
+{dataframe-transforms}, so it therefore may only be a temporary resource 
+availability issue.
+
+For a batch {dataframe-transform}, the number of buckets requested is only ever 
+adjusted downwards. The lowering of value may result in a longer duration for the 
+transform checkpoint to complete. For {cdataframes}, the number of 
+buckets requested is reset back to its default at the start of every checkpoint 
+and it is possible for circuit breaker exceptions to occur repeatedly in the 
+{es} logs. 
+
+The {dataframe-transform} retrieves data in batches which means it calculates 
+several buckets at once. Per default this is 500 buckets per search/index 
+operation. The default can be changed using `max_page_search_size` and the 
+minimum value is 10. If failures still occur once the number of buckets 
+requested has been reduced to its minimum, then the {dataframe-transform} will 
+be set to a failed state.
+
+[float]
+[[df-dynamic-adjustments-limitations]]
+=== Handling dynamic adjustments for many terms
+
+For each checkpoint, entities are identified that have changed since the last 
+time the check was performed. This list of changed entities is supplied as a 
+{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform} 
+composite aggregation, one page at a time. Then updates are applied to the 
+destination index for each page of entities.
+
+The page `size` is defined by `max_page_search_size` which is also used to 
+define the number of buckets returned by the composite aggregation search. The 
+default value is 500, the minimum is 10.
+
+The index setting 
+{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines 
+the maximum number of terms that can be used in a terms query. The default value 
+is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the 
+transform will fail. 
+
+Using smaller values for `max_page_search_size` may result in a longer duration 
+for the transform checkpoint to complete.
+
+[float]
+[[df-scheduling-limitations]]
+=== {cdataframe-cap} scheduling limitations
+
+A {cdataframe} periodically checks for changes to source data. The functionality 
+of the scheduler is currently limited to a basic periodic timer which can be 
+within the `frequency` range from 1s to 1h. The default is 1m. This is designed 
+to run little and often. When choosing a `frequency` for this timer consider 
+your ingest rate along with the impact that the {dataframe-transform} 
+search/index operations has other users in your cluster. Also note that retries 
+occur at `frequency` interval.
+
+[float]
+[[df-failed-limitations]]
+=== Handling of failed {dataframe-transforms}
+
+Failed {dataframe-transforms} remain as a persistent task and should be handled 
+appropriately, either by deleting it or by resolving the root cause of the 
+failure and re-starting.
+
+When using the API to delete a failed {dataframe-transform}, first stop it using 
+`_stop?force=true`, then delete it.
+
+If starting a failed {dataframe-transform}, after the root cause has been 
+resolved, the `_start?force=true` parameter must be specified.
+
+[float]
+[[df-availability-limitations]]
+=== {cdataframes-cap} may give incorrect results if documents are not yet available to search
+
+After a document is indexed, there is a very small delay until it is available 
+to search.
+
+A {cdataframe-transform} periodically checks for changed entities between the 
+time since it last checked and `now` minus `sync.time.delay`. This time window 
+moves without overlapping. If the timestamp of a recently indexed document falls 
+within this time window but this document is not yet available to search then 
+this entity will not be updated.
+
+If using a `sync.time.field` that represents the data ingest time and using a 
+zero second or very small `sync.time.delay`, then it is more likely that this 
+issue will occur.
--- a/docs/reference/transform/overview.asciidoc
+++ b/docs/reference/transform/overview.asciidoc
@ -0,0 +1,71 @@
+[role="xpack"]
+[[ml-transform-overview]]
+== {dataframe-transform-cap} overview
++++
+<titleabbrev>Overview</titleabbrev>
++++
+
+beta[]
+
+A _{dataframe}_ is a two-dimensional tabular data structure. In the context of
+the {stack}, it is a transformation of data that is indexed in {es}. For
+example, you can use {dataframes} to _pivot_ your data into a new entity-centric
+index. By transforming and summarizing your data, it becomes possible to
+visualize and analyze it in alternative and interesting ways.
+
+A lot of {es} indices are organized as a stream of events: each event is an 
+individual document, for example a single item purchase. {dataframes-cap} enable
+you to summarize this data, bringing it into an organized, more
+analysis-friendly format. For example, you can summarize all the purchases of a
+single customer.
+
+You can create {dataframes} by using {dataframe-transforms}.
+{dataframe-transforms-cap} enable you to define a pivot, which is a set of
+features that transform the index into a different, more digestible format.
+Pivoting results in a summary of your data, which is the {dataframe}.
+
+To define a pivot, first you select one or more fields that you will use to
+group your data. You can select categorical fields (terms) and numerical fields
+for grouping. If you use numerical fields, the field values are bucketed using
+an interval that you specify.
+
+The second step is deciding how you want to aggregate the grouped data. When 
+using aggregations, you practically ask questions about the index. There are 
+different types of aggregations, each with its own purpose and output. To learn 
+more about the supported aggregations and group-by fields, see 
+{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources].
+
+As an optional step, you can also add a query to further limit the scope of the
+aggregation.
+
+The {dataframe-transform} performs a composite aggregation that 
+paginates through all the data defined by the source index query. The output of
+the aggregation is stored in a destination index. Each time the 
+{dataframe-transform} queries the source index, it creates a _checkpoint_. You 
+can decide whether you want the {dataframe-transform} to run once (batch 
+{dataframe-transform}) or continuously ({cdataframe-transform}). A batch 
+{dataframe-transform} is a single operation that has a single checkpoint. 
+{cdataframe-transforms-cap} continually increment and process checkpoints as new 
+source data is ingested.
+
+.Example
+
+Imagine that you run a webshop that sells clothes. Every order creates a document 
+that contains a unique order ID, the name and the category of the ordered product, 
+its price, the ordered quantity, the exact date of the order, and some customer 
+information (name, gender, location, etc). Your dataset contains all the transactions 
+from last year.
+
+If you want to check the sales in the different categories in your last fiscal
+year, define a {dataframe-transform} that groups the data by the product
+categories (women's shoes, men's clothing, etc.) and the order date. Use the
+last year as the interval for the order date. Then add a sum aggregation on the
+ordered quantity. The result is a {dataframe} that shows the number of sold
+items in every product category in the last year.
+
+[role="screenshot"]
+image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"]
+
+IMPORTANT: The {dataframe-transform} leaves your source index intact. It
+creates a new index that is dedicated to the {dataframe}.
+
--- a/docs/reference/transform/troubleshooting.asciidoc
+++ b/docs/reference/transform/troubleshooting.asciidoc
@ -0,0 +1,29 @@
+[[dataframe-troubleshooting]]
+== Troubleshooting {dataframe-transforms}
+[subs="attributes"]
++++
+<titleabbrev>Troubleshooting</titleabbrev>
++++
+
+Use the information in this section to troubleshoot common problems.
+
+include::{stack-repo-dir}/help.asciidoc[tag=get-help]
+
+If you encounter problems with your {dataframe-transforms}, you can gather more
+information from the following files and APIs:
+
+* Lightweight audit messages are stored in `.data-frame-notifications-*`. Search
+by your `transform_id`.
+* The
+{ref}/get-data-frame-transform-stats.html[get {dataframe-transform} statistics API] 
+provides information about the transform status and failures.
+* If the {dataframe-transform} exists as a task, you can use the
+{ref}/tasks.html[task management API] to gather task information. For example:
+`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists
+when the transform is in a started or failed state.
+* The {es} logs from the node that was running the {dataframe-transform} might
+also contain useful information. You can identify the node from the notification
+messages. Alternatively, if the task still exists, you can get that information
+from the get {dataframe-transform} statistics API. For more information, see
+{ref}/logging.html[Logging configuration].
+