OpenSearch/docs/reference/transform/overview.asciidoc

[role="xpack"]
[[ml-transform-overview]]
== {dataframe-transform-cap} overview
++++
<titleabbrev>Overview</titleabbrev>
++++

beta[]

A _{dataframe}_ is a two-dimensional tabular data structure. In the context of
the {stack}, it is a transformation of data that is indexed in {es}. For
example, you can use {dataframes} to _pivot_ your data into a new entity-centric
index. By transforming and summarizing your data, it becomes possible to
visualize and analyze it in alternative and interesting ways.

A lot of {es} indices are organized as a stream of events: each event is an 
individual document, for example a single item purchase. {dataframes-cap} enable
you to summarize this data, bringing it into an organized, more
analysis-friendly format. For example, you can summarize all the purchases of a
single customer.

You can create {dataframes} by using {dataframe-transforms}.
{dataframe-transforms-cap} enable you to define a pivot, which is a set of
features that transform the index into a different, more digestible format.
Pivoting results in a summary of your data, which is the {dataframe}.

To define a pivot, first you select one or more fields that you will use to
group your data. You can select categorical fields (terms) and numerical fields
for grouping. If you use numerical fields, the field values are bucketed using
an interval that you specify.

The second step is deciding how you want to aggregate the grouped data. When 
using aggregations, you practically ask questions about the index. There are 
different types of aggregations, each with its own purpose and output. To learn 
more about the supported aggregations and group-by fields, see 
{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources].

As an optional step, you can also add a query to further limit the scope of the
aggregation.

The {dataframe-transform} performs a composite aggregation that 
paginates through all the data defined by the source index query. The output of
the aggregation is stored in a destination index. Each time the 
{dataframe-transform} queries the source index, it creates a _checkpoint_. You 
can decide whether you want the {dataframe-transform} to run once (batch 
{dataframe-transform}) or continuously ({cdataframe-transform}). A batch 
{dataframe-transform} is a single operation that has a single checkpoint. 
{cdataframe-transforms-cap} continually increment and process checkpoints as new 
source data is ingested.

.Example

Imagine that you run a webshop that sells clothes. Every order creates a document 
that contains a unique order ID, the name and the category of the ordered product, 
its price, the ordered quantity, the exact date of the order, and some customer 
information (name, gender, location, etc). Your dataset contains all the transactions 
from last year.

If you want to check the sales in the different categories in your last fiscal
year, define a {dataframe-transform} that groups the data by the product
categories (women's shoes, men's clothing, etc.) and the order date. Use the
last year as the interval for the order date. Then add a sum aggregation on the
ordered quantity. The result is a {dataframe} that shows the number of sold
items in every product category in the last year.

[role="screenshot"]
image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"]

IMPORTANT: The {dataframe-transform} leaves your source index intact. It
creates a new index that is dedicated to the {dataframe}.
[DOCS] Adds transform content (#46575) (#46578) 2019-09-11 08:44:03 -07:00			`[role="xpack"]`
			`[[ml-transform-overview]]`
			`== {dataframe-transform-cap} overview`
			`++++`
			`<titleabbrev>Overview</titleabbrev>`
			`++++`

			`beta[]`

			`A _{dataframe}_ is a two-dimensional tabular data structure. In the context of`
			`the {stack}, it is a transformation of data that is indexed in {es}. For`
			`example, you can use {dataframes} to _pivot_ your data into a new entity-centric`
			`index. By transforming and summarizing your data, it becomes possible to`
			`visualize and analyze it in alternative and interesting ways.`

			`A lot of {es} indices are organized as a stream of events: each event is an`
			`individual document, for example a single item purchase. {dataframes-cap} enable`
			`you to summarize this data, bringing it into an organized, more`
			`analysis-friendly format. For example, you can summarize all the purchases of a`
			`single customer.`

			`You can create {dataframes} by using {dataframe-transforms}.`
			`{dataframe-transforms-cap} enable you to define a pivot, which is a set of`
			`features that transform the index into a different, more digestible format.`
			`Pivoting results in a summary of your data, which is the {dataframe}.`

			`To define a pivot, first you select one or more fields that you will use to`
			`group your data. You can select categorical fields (terms) and numerical fields`
			`for grouping. If you use numerical fields, the field values are bucketed using`
			`an interval that you specify.`

			`The second step is deciding how you want to aggregate the grouped data. When`
			`using aggregations, you practically ask questions about the index. There are`
			`different types of aggregations, each with its own purpose and output. To learn`
			`more about the supported aggregations and group-by fields, see`
			`{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources].`

			`As an optional step, you can also add a query to further limit the scope of the`
			`aggregation.`

			`The {dataframe-transform} performs a composite aggregation that`
			`paginates through all the data defined by the source index query. The output of`
			`the aggregation is stored in a destination index. Each time the`
			`{dataframe-transform} queries the source index, it creates a _checkpoint_. You`
			`can decide whether you want the {dataframe-transform} to run once (batch`
			`{dataframe-transform}) or continuously ({cdataframe-transform}). A batch`
			`{dataframe-transform} is a single operation that has a single checkpoint.`
			`{cdataframe-transforms-cap} continually increment and process checkpoints as new`
			`source data is ingested.`

			`.Example`

			`Imagine that you run a webshop that sells clothes. Every order creates a document`
			`that contains a unique order ID, the name and the category of the ordered product,`
			`its price, the ordered quantity, the exact date of the order, and some customer`
			`information (name, gender, location, etc). Your dataset contains all the transactions`
			`from last year.`

			`If you want to check the sales in the different categories in your last fiscal`
			`year, define a {dataframe-transform} that groups the data by the product`
			`categories (women's shoes, men's clothing, etc.) and the order date. Use the`
			`last year as the interval for the order date. Then add a sum aggregation on the`
			`ordered quantity. The result is a {dataframe} that shows the number of sold`
			`items in every product category in the last year.`

			`[role="screenshot"]`
			`image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"]`

			`IMPORTANT: The {dataframe-transform} leaves your source index intact. It`
			`creates a new index that is dedicated to the {dataframe}.`