[DOCS] Adds transform content (#46575) (#46578)

This commit is contained in:
Lisa Cawley 2019-09-11 08:44:03 -07:00 committed by GitHub
parent 461de5b58e
commit c0ec6ade4b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
15 changed files with 1107 additions and 0 deletions

View File

@ -0,0 +1,21 @@
[role="xpack"]
[[df-api-quickref]]
== API quick reference
All {dataframe-transform} endpoints have the following base:
[source,js]
----
/_data_frame/transforms/
----
// NOTCONSOLE
* {ref}/put-data-frame-transform.html[Create {dataframe-transforms}]
* {ref}/delete-data-frame-transform.html[Delete {dataframe-transforms}]
* {ref}/get-data-frame-transform.html[Get {dataframe-transforms}]
* {ref}/get-data-frame-transform-stats.html[Get {dataframe-transforms} statistics]
* {ref}/preview-data-frame-transform.html[Preview {dataframe-transforms}]
* {ref}/start-data-frame-transform.html[Start {dataframe-transforms}]
* {ref}/stop-data-frame-transform.html[Stop {dataframe-transforms}]
For the full list, see {ref}/data-frame-apis.html[{dataframe-transform-cap} APIs].

View File

@ -0,0 +1,88 @@
[role="xpack"]
[[ml-transform-checkpoints]]
== How {dataframe-transform} checkpoints work
++++
<titleabbrev>How checkpoints work</titleabbrev>
++++
beta[]
Each time a {dataframe-transform} examines the source indices and creates or
updates the destination index, it generates a _checkpoint_.
If your {dataframe-transform} runs only once, there is logically only one
checkpoint. If your {dataframe-transform} runs continuously, however, it creates
checkpoints as it ingests and transforms new source data.
To create a checkpoint, the {cdataframe-transform}:
. Checks for changes to source indices.
+
Using a simple periodic timer, the {dataframe-transform} checks for changes to
the source indices. This check is done based on the interval defined in the
transform's `frequency` property.
+
If the source indices remain unchanged or if a checkpoint is already in progress
then it waits for the next timer.
. Identifies which entities have changed.
+
The {dataframe-transform} searches to see which entities have changed since the
last time it checked. The transform's `sync` configuration object identifies a
time field in the source indices. The transform uses the values in that field to
synchronize the source and destination indices.
. Updates the destination index (the {dataframe}) with the changed entities.
+
--
The {dataframe-transform} applies changes related to either new or changed
entities to the destination index. The set of changed entities is paginated. For
each page, the {dataframe-transform} performs a composite aggregation using a
`terms` query. After all the pages of changes have been applied, the checkpoint
is complete.
--
This checkpoint process involves both search and indexing activity on the
cluster. We have attempted to favor control over performance while developing
{dataframe-transforms}. We decided it was preferable for the
{dataframe-transform} to take longer to complete, rather than to finish quickly
and take precedence in resource consumption. That being said, the cluster still
requires enough resources to support both the composite aggregation search and
the indexing of its results.
TIP: If the cluster experiences unsuitable performance degradation due to the
{dataframe-transform}, stop the transform. Consider whether you can apply a
source query to the {dataframe-transform} to reduce the scope of data it
processes. Also consider whether the cluster has sufficient resources in place
to support both the composite aggregation search and the indexing of its
results.
[discrete]
[[ml-transform-checkpoint-errors]]
==== Error handling
Failures in {dataframe-transforms} tend to be related to searching or indexing.
To increase the resiliency of {dataframe-transforms}, the cursor positions of
the aggregated search and the changed entities search are tracked in memory and
persisted periodically.
Checkpoint failures can be categorized as follows:
* Temporary failures: The checkpoint is retried. If 10 consecutive failures
occur, the {dataframe-transform} has a failed status. For example, this
situation might occur when there are shard failures and queries return only
partial results.
* Irrecoverable failures: The {dataframe-transform} immediately fails. For
example, this situation occurs when the source index is not found.
* Adjustment failures: The {dataframe-transform} retries with adjusted settings.
For example, if a parent circuit breaker memory errors occur during the
composite aggregation, the transform receives partial results. The aggregated
search is retried with a smaller number of buckets. This retry is performed at
the interval defined in the transform's `frequency` property. If the search
is retried to the point where it reaches a minimal number of buckets, an
irrecoverable failure occurs.
If the node running the {dataframe-transforms} fails, the transform restarts
from the most recent persisted cursor position. This recovery process might
repeat some of the work the transform had already done, but it ensures data
consistency.

View File

@ -0,0 +1,335 @@
[role="xpack"]
[testenv="basic"]
[[dataframe-examples]]
== {dataframe-transform-cap} examples
++++
<titleabbrev>Examples</titleabbrev>
++++
beta[]
These examples demonstrate how to use {dataframe-transforms} to derive useful
insights from your data. All the examples use one of the
{kibana-ref}/add-sample-data.html[{kib} sample datasets]. For a more detailed,
step-by-step example, see
<<ecommerce-dataframes,Transforming your data with {dataframes}>>.
* <<ecommerce-dataframes>>
* <<example-best-customers>>
* <<example-airline>>
* <<example-clientips>>
include::ecommerce-example.asciidoc[]
[[example-best-customers]]
=== Finding your best customers
In this example, we use the eCommerce orders sample dataset to find the customers
who spent the most in our hypothetical webshop. Let's transform the data such
that the destination index contains the number of orders, the total price of
the orders, the amount of unique products and the average price per order,
and the total amount of ordered products for each customer.
[source,console]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_ecommerce"
},
"dest" : { <1>
"index" : "sample_ecommerce_orders_by_customer"
},
"pivot": {
"group_by": { <2>
"user": { "terms": { "field": "user" }},
"customer_id": { "terms": { "field": "customer_id" }}
},
"aggregations": {
"order_count": { "value_count": { "field": "order_id" }},
"total_order_amt": { "sum": { "field": "taxful_total_price" }},
"avg_amt_per_order": { "avg": { "field": "taxful_total_price" }},
"avg_unique_products_per_order": { "avg": { "field": "total_unique_products" }},
"total_unique_products": { "cardinality": { "field": "products.product_id" }}
}
}
}
----------------------------------
// TEST[skip:setup kibana sample data]
<1> This is the destination index for the {dataframe}. It is ignored by
`_preview`.
<2> Two `group_by` fields have been selected. This means the {dataframe} will
contain a unique row per `user` and `customer_id` combination. Within this
dataset both these fields are unique. By including both in the {dataframe} it
gives more context to the final results.
NOTE: In the example above, condensed JSON formatting has been used for easier
readability of the pivot object.
The preview {dataframe-transforms} API enables you to see the layout of the
{dataframe} in advance, populated with some sample values. For example:
[source,js]
----------------------------------
{
"preview" : [
{
"total_order_amt" : 3946.9765625,
"order_count" : 59.0,
"total_unique_products" : 116.0,
"avg_unique_products_per_order" : 2.0,
"customer_id" : "10",
"user" : "recip",
"avg_amt_per_order" : 66.89790783898304
},
...
]
}
----------------------------------
// NOTCONSOLE
This {dataframe} makes it easier to answer questions such as:
* Which customers spend the most?
* Which customers spend the most per order?
* Which customers order most often?
* Which customers ordered the least number of different products?
It's possible to answer these questions using aggregations alone, however
{dataframes} allow us to persist this data as a customer centric index. This
enables us to analyze data at scale and gives more flexibility to explore and
navigate data from a customer centric perspective. In some cases, it can even
make creating visualizations much simpler.
[[example-airline]]
=== Finding air carriers with the most delays
In this example, we use the Flights sample dataset to find out which air carrier
had the most delays. First, we filter the source data such that it excludes all
the cancelled flights by using a query filter. Then we transform the data to
contain the distinct number of flights, the sum of delayed minutes, and the sum
of the flight minutes by air carrier. Finally, we use a
{ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_script`]
to determine what percentage of the flight time was actually delay.
[source,console]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_flights",
"query": { <1>
"bool": {
"filter": [
{ "term": { "Cancelled": false } }
]
}
}
},
"dest" : { <2>
"index" : "sample_flight_delays_by_carrier"
},
"pivot": {
"group_by": { <3>
"carrier": { "terms": { "field": "Carrier" }}
},
"aggregations": {
"flights_count": { "value_count": { "field": "FlightNum" }},
"delay_mins_total": { "sum": { "field": "FlightDelayMin" }},
"flight_mins_total": { "sum": { "field": "FlightTimeMin" }},
"delay_time_percentage": { <4>
"bucket_script": {
"buckets_path": {
"delay_time": "delay_mins_total.value",
"flight_time": "flight_mins_total.value"
},
"script": "(params.delay_time / params.flight_time) * 100"
}
}
}
}
}
----------------------------------
// TEST[skip:setup kibana sample data]
<1> Filter the source data to select only flights that were not cancelled.
<2> This is the destination index for the {dataframe}. It is ignored by
`_preview`.
<3> The data is grouped by the `Carrier` field which contains the airline name.
<4> This `bucket_script` performs calculations on the results that are returned
by the aggregation. In this particular example, it calculates what percentage of
travel time was taken up by delays.
The preview shows you that the new index would contain data like this for each
carrier:
[source,js]
----------------------------------
{
"preview" : [
{
"carrier" : "ES-Air",
"flights_count" : 2802.0,
"flight_mins_total" : 1436927.5130677223,
"delay_time_percentage" : 9.335543983955839,
"delay_mins_total" : 134145.0
},
...
]
}
----------------------------------
// NOTCONSOLE
This {dataframe} makes it easier to answer questions such as:
* Which air carrier has the most delays as a percentage of flight time?
NOTE: This data is fictional and does not reflect actual delays
or flight stats for any of the featured destination or origin airports.
[[example-clientips]]
=== Finding suspicious client IPs by using scripted metrics
With {dataframe-transforms}, you can use
{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[scripted
metric aggregations] on your data. These aggregations are flexible and make
it possible to perform very complex processing. Let's use scripted metrics to
identify suspicious client IPs in the web log sample dataset.
We transform the data such that the new index contains the sum of bytes and the
number of distinct URLs, agents, incoming requests by location, and geographic
destinations for each client IP. We also use a scripted field to count the
specific types of HTTP responses that each client IP receives. Ultimately, the
example below transforms web log data into an entity centric index where the
entity is `clientip`.
[source,console]
----------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_logs",
"query": { <1>
"range" : {
"timestamp" : {
"gte" : "now-30d/d"
}
}
}
},
"dest" : { <2>
"index" : "sample_weblogs_by_clientip"
},
"pivot": {
"group_by": { <3>
"clientip": { "terms": { "field": "clientip" } }
},
"aggregations": {
"url_dc": { "cardinality": { "field": "url.keyword" }},
"bytes_sum": { "sum": { "field": "bytes" }},
"geo.src_dc": { "cardinality": { "field": "geo.src" }},
"agent_dc": { "cardinality": { "field": "agent.keyword" }},
"geo.dest_dc": { "cardinality": { "field": "geo.dest" }},
"responses.total": { "value_count": { "field": "timestamp" }},
"responses.counts": { <4>
"scripted_metric": {
"init_script": "state.responses = ['error':0L,'success':0L,'other':0L]",
"map_script": """
def code = doc['response.keyword'].value;
if (code.startsWith('5') || code.startsWith('4')) {
state.responses.error += 1 ;
} else if(code.startsWith('2')) {
state.responses.success += 1;
} else {
state.responses.other += 1;
}
""",
"combine_script": "state.responses",
"reduce_script": """
def counts = ['error': 0L, 'success': 0L, 'other': 0L];
for (responses in states) {
counts.error += responses['error'];
counts.success += responses['success'];
counts.other += responses['other'];
}
return counts;
"""
}
},
"timestamp.min": { "min": { "field": "timestamp" }},
"timestamp.max": { "max": { "field": "timestamp" }},
"timestamp.duration_ms": { <5>
"bucket_script": {
"buckets_path": {
"min_time": "timestamp.min.value",
"max_time": "timestamp.max.value"
},
"script": "(params.max_time - params.min_time)"
}
}
}
}
}
----------------------------------
// TEST[skip:setup kibana sample data]
<1> This range query limits the transform to documents that are within the last
30 days at the point in time the {dataframe-transform} checkpoint is processed.
For batch {dataframes} this occurs once.
<2> This is the destination index for the {dataframe}. It is ignored by
`_preview`.
<3> The data is grouped by the `clientip` field.
<4> This `scripted_metric` performs a distributed operation on the web log data
to count specific types of HTTP responses (error, success, and other).
<5> This `bucket_script` calculates the duration of the `clientip` access based
on the results of the aggregation.
The preview shows you that the new index would contain data like this for each
client IP:
[source,js]
----------------------------------
{
"preview" : [
{
"geo" : {
"src_dc" : 12.0,
"dest_dc" : 9.0
},
"clientip" : "0.72.176.46",
"agent_dc" : 3.0,
"responses" : {
"total" : 14.0,
"counts" : {
"other" : 0,
"success" : 14,
"error" : 0
}
},
"bytes_sum" : 74808.0,
"timestamp" : {
"duration_ms" : 4.919943239E9,
"min" : "2019-06-17T07:51:57.333Z",
"max" : "2019-08-13T06:31:00.572Z"
},
"url_dc" : 11.0
},
...
}
----------------------------------
// NOTCONSOLE
This {dataframe} makes it easier to answer questions such as:
* Which client IPs are transferring the most amounts of data?
* Which client IPs are interacting with a high number of different URLs?
* Which client IPs have high error rates?
* Which client IPs are interacting with a high number of destination countries?

View File

@ -0,0 +1,262 @@
[role="xpack"]
[testenv="basic"]
[[ecommerce-dataframes]]
=== Transforming the eCommerce sample data
beta[]
<<ml-dataframes,{dataframe-transforms-cap}>> enable you to retrieve information
from an {es} index, transform it, and store it in another index. Let's use the
{kibana-ref}/add-sample-data.html[{kib} sample data] to demonstrate how you can
pivot and summarize your data with {dataframe-transforms}.
. If the {es} {security-features} are enabled, obtain a user ID with sufficient
privileges to complete these steps.
+
--
You need `manage_data_frame_transforms` cluster privileges to preview and create
{dataframe-transforms}. Members of the built-in `data_frame_transforms_admin`
role have these privileges.
You also need `read` and `view_index_metadata` index privileges on the source
index and `read`, `create_index`, and `index` privileges on the destination
index.
For more information, see <<security-privileges>> and <<built-in-roles>>.
--
. Choose your _source index_.
+
--
In this example, we'll use the eCommerce orders sample data. If you're not
already familiar with the `kibana_sample_data_ecommerce` index, use the
*Revenue* dashboard in {kib} to explore the data. Consider what insights you
might want to derive from this eCommerce data.
--
. Play with various options for grouping and aggregating the data.
+
--
For example, you might want to group the data by product ID and calculate the
total number of sales for each product and its average price. Alternatively, you
might want to look at the behavior of individual customers and calculate how
much each customer spent in total and how many different categories of products
they purchased. Or you might want to take the currencies or geographies into
consideration. What are the most interesting ways you can transform and
interpret this data?
_Pivoting_ your data involves using at least one field to group it and applying
at least one aggregation. You can preview what the transformed data will look
like, so go ahead and play with it!
For example, go to *Machine Learning* > *Data Frames* in {kib} and use the
wizard to create a {dataframe-transform}:
[role="screenshot"]
image::images/ecommerce-pivot1.jpg["Creating a simple {dataframe-transform} in {kib}"]
In this case, we grouped the data by customer ID and calculated the sum of
products each customer purchased.
Let's add some more aggregations to learn more about our customers' orders. For
example, let's calculate the total sum of their purchases, the maximum number of
products that they purchased in a single order, and their total number of orders.
We'll accomplish this by using the
{ref}/search-aggregations-metrics-sum-aggregation.html[`sum` aggregation] on the
`taxless_total_price` field, the
{ref}/search-aggregations-metrics-max-aggregation.html[`max` aggregation] on the
`total_quantity` field, and the
{ref}/search-aggregations-metrics-cardinality-aggregation.html[`cardinality` aggregation]
on the `order_id` field:
[role="screenshot"]
image::images/ecommerce-pivot2.jpg["Adding multiple aggregations to a {dataframe-transform} in {kib}"]
TIP: If you're interested in a subset of the data, you can optionally include a
{ref}/search-request-body.html#request-body-search-query[query] element. In this
example, we've filtered the data so that we're only looking at orders with a
`currency` of `EUR`. Alternatively, we could group the data by that field too.
If you want to use more complex queries, you can create your {dataframe} from a
{kibana-ref}/save-open-search.html[saved search].
If you prefer, you can use the
{ref}/preview-data-frame-transform.html[preview {dataframe-transforms} API]:
[source,js]
--------------------------------------------------
POST _data_frame/transforms/_preview
{
"source": {
"index": "kibana_sample_data_ecommerce",
"query": {
"bool": {
"filter": {
"term": {"currency": "EUR"}
}
}
}
},
"pivot": {
"group_by": {
"customer_id": {
"terms": {
"field": "customer_id"
}
}
},
"aggregations": {
"total_quantity.sum": {
"sum": {
"field": "total_quantity"
}
},
"taxless_total_price.sum": {
"sum": {
"field": "taxless_total_price"
}
},
"total_quantity.max": {
"max": {
"field": "total_quantity"
}
},
"order_id.cardinality": {
"cardinality": {
"field": "order_id"
}
}
}
}
}
--------------------------------------------------
// CONSOLE
// TEST[skip:set up sample data]
--
. When you are satisfied with what you see in the preview, create the
{dataframe-transform}.
+
--
.. Supply a job ID and the name of the target (or _destination_) index.
.. Decide whether you want the {dataframe-transform} to run once or continuously.
--
+
--
Since this sample data index is unchanging, let's use the default behavior and
just run the {dataframe-transform} once.
[role="screenshot"]
image::images/ecommerce-batch.jpg["Specifying the {dataframe-transform} options in {kib}"]
If you want to try it out, however, go ahead and click on *Continuous mode*.
You must choose a field that the {dataframe-transform} can use to check which
entities have changed. In general, it's a good idea to use the ingest timestamp
field. In this example, however, you can use the `order_date` field.
If you prefer, you can use the
{ref}/put-data-frame-transform.html[create {dataframe-transforms} API]. For
example:
[source,js]
--------------------------------------------------
PUT _data_frame/transforms/ecommerce-customer-transform
{
"source": {
"index": [
"kibana_sample_data_ecommerce"
],
"query": {
"bool": {
"filter": {
"term": {
"currency": "EUR"
}
}
}
}
},
"pivot": {
"group_by": {
"customer_id": {
"terms": {
"field": "customer_id"
}
}
},
"aggregations": {
"total_quantity.sum": {
"sum": {
"field": "total_quantity"
}
},
"taxless_total_price.sum": {
"sum": {
"field": "taxless_total_price"
}
},
"total_quantity.max": {
"max": {
"field": "total_quantity"
}
},
"order_id.cardinality": {
"cardinality": {
"field": "order_id"
}
}
}
},
"dest": {
"index": "ecommerce-customers"
}
}
--------------------------------------------------
// CONSOLE
// TEST[skip:setup kibana sample data]
--
. Start the {dataframe-transform}.
+
--
TIP: Even though resource utilization is automatically adjusted based on the
cluster load, a {dataframe-transform} increases search and indexing load on your
cluster while it runs. If you're experiencing an excessive load, however, you
can stop it.
You can start, stop, and manage {dataframe-transforms} in {kib}:
[role="screenshot"]
image::images/dataframe-transforms.jpg["Managing {dataframe-transforms} in {kib}"]
Alternatively, you can use the
{ref}/start-data-frame-transform.html[start {dataframe-transforms}] and
{ref}/stop-data-frame-transform.html[stop {dataframe-transforms}] APIs. For
example:
[source,js]
--------------------------------------------------
POST _data_frame/transforms/ecommerce-customer-transform/_start
--------------------------------------------------
// CONSOLE
// TEST[skip:setup kibana sample data]
--
. Explore the data in your new index.
+
--
For example, use the *Discover* application in {kib}:
[role="screenshot"]
image::images/ecommerce-results.jpg["Exploring the new index in {kib}"]
--
TIP: If you do not want to keep the {dataframe-transform}, you can delete it in
{kib} or use the
{ref}/delete-data-frame-transform.html[delete {dataframe-transform} API]. When
you delete a {dataframe-transform}, its destination index and {kib} index
patterns remain.

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 194 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 489 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 558 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 339 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

View File

@ -0,0 +1,82 @@
[role="xpack"]
[[ml-dataframes]]
= {dataframe-transforms-cap}
[partintro]
--
beta[]
{es} aggregations are a powerful and flexible feature that enable you to
summarize and retrieve complex insights about your data. You can summarize
complex things like the number of web requests per day on a busy website, broken
down by geography and browser type. If you use the same data set to try to
calculate something as simple as a single number for the average duration of
visitor web sessions, however, you can quickly run out of memory.
Why does this occur? A web session duration is an example of a behavioral
attribute not held on any one log record; it has to be derived by finding the
first and last records for each session in our weblogs. This derivation requires
some complex query expressions and a lot of memory to connect all the data
points. If you have an ongoing background process that fuses related events from
one index into entity-centric summaries in another index, you get a more useful,
joined-up picture--this is essentially what _{dataframes}_ are.
[discrete]
[[ml-dataframes-usage]]
== When to use {dataframes}
You might want to consider using {dataframes} instead of aggregations when:
* You need a complete _feature index_ rather than a top-N set of items.
+
In {ml}, you often need a complete set of behavioral features rather just the
top-N. For example, if you are predicting customer churn, you might look at
features such as the number of website visits in the last week, the total number
of sales, or the number of emails sent. The {stack} {ml-features} create models
based on this multi-dimensional feature space, so they benefit from full feature
indices ({dataframes}).
+
This scenario also applies when you are trying to search across the results of
an aggregation or multiple aggregations. Aggregation results can be ordered or
filtered, but there are
{ref}/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-order[limitations to ordering]
and
{ref}/search-aggregations-pipeline-bucket-selector-aggregation.html[filtering by bucket selector]
is constrained by the maximum number of buckets returned. If you want to search
all aggregation results, you need to create the complete {dataframe}. If you
need to sort or filter the aggregation results by multiple fields, {dataframes}
are particularly useful.
* You need to sort aggregation results by a pipeline aggregation.
+
{ref}/search-aggregations-pipeline.html[Pipeline aggregations] cannot be used
for sorting. Technically, this is because pipeline aggregations are run during
the reduce phase after all other aggregations have already completed. If you
create a {dataframe}, you can effectively perform multiple passes over the data.
* You want to create summary tables to optimize queries.
+
For example, if you
have a high level dashboard that is accessed by a large number of users and it
uses a complex aggregation over a large dataset, it may be more efficient to
create a {dataframe} to cache results. Thus, each user doesn't need to run the
aggregation query.
Though there are multiple ways to create {dataframes}, this content pertains
to one specific method: _{dataframe-transforms}_.
* <<ml-transform-overview>>
* <<df-api-quickref>>
* <<dataframe-examples>>
* <<dataframe-troubleshooting>>
* <<dataframe-limitations>>
--
include::overview.asciidoc[]
include::checkpoints.asciidoc[]
include::api-quickref.asciidoc[]
include::dataframe-examples.asciidoc[]
include::troubleshooting.asciidoc[]
include::limitations.asciidoc[]

View File

@ -0,0 +1,219 @@
[role="xpack"]
[[dataframe-limitations]]
== {dataframe-transform-cap} limitations
[subs="attributes"]
++++
<titleabbrev>Limitations</titleabbrev>
++++
beta[]
The following limitations and known problems apply to the 7.4 release of
the Elastic {dataframe} feature:
[float]
[[df-compatibility-limitations]]
=== Beta {dataframe-transforms} do not have guaranteed backwards or forwards compatibility
Whilst {dataframe-transforms} are beta, it is not guaranteed that a
{dataframe-transform} created in a previous version of the {stack} will be able
to start and operate in a future version. Neither can support be provided for
{dataframe-transform} tasks to be able to operate in a cluster with mixed node
versions.
Please note that the output of a {dataframe-transform} is persisted to a
destination index. This is a normal {es} index and is not affected by the beta
status.
[float]
[[df-ui-limitation]]
=== {dataframe-cap} UI will not work during a rolling upgrade from 7.2
If your cluster contains mixed version nodes, for example during a rolling
upgrade from 7.2 to a newer version, and {dataframe-transforms} have been
created in 7.2, the {dataframe} UI will not work. Please wait until all nodes
have been upgraded to the newer version before using the {dataframe} UI.
[float]
[[df-datatype-limitations]]
=== {dataframe-cap} data type limitation
{dataframes-cap} do not (yet) support fields containing arrays in the UI or
the API. If you try to create one, the UI will fail to show the source index
table.
[float]
[[df-ccs-limitations]]
=== {ccs-cap} is not supported
{ccs-cap} is not supported for {dataframe-transforms}.
[float]
[[df-kibana-limitations]]
=== Up to 1,000 {dataframe-transforms} are supported
A single cluster will support up to 1,000 {dataframe-transforms}.
When using the
{ref}/get-data-frame-transform.html[GET {dataframe-transforms} API] a total
`count` of transforms is returned. Use the `size` and `from` parameters to
enumerate through the full list.
[float]
[[df-aggresponse-limitations]]
=== Aggregation responses may be incompatible with destination index mappings
When a {dataframe-transform} is first started, it will deduce the mappings
required for the destination index. This process is based on the field types of
the source index and the aggregations used. If the fields are derived from
{ref}/search-aggregations-metrics-scripted-metric-aggregation.html[`scripted_metrics`]
or {ref}/search-aggregations-pipeline-bucket-script-aggregation.html[`bucket_scripts`],
{ref}/dynamic-mapping.html[dynamic mappings] will be used. In some instances the
deduced mappings may be incompatible with the actual data. For example, numeric
overflows might occur or dynamically mapped fields might contain both numbers
and strings. Please check {es} logs if you think this may have occurred. As a
workaround, you may define custom mappings prior to starting the
{dataframe-transform}. For example,
{ref}/indices-create-index.html[create a custom destination index] or
{ref}/indices-templates.html[define an index template].
[float]
[[df-batch-limitations]]
=== Batch {dataframe-transforms} may not account for changed documents
A batch {dataframe-transform} uses a
{ref}/search-aggregations-bucket-composite-aggregation.html[composite aggregation]
which allows efficient pagination through all buckets. Composite aggregations
do not yet support a search context, therefore if the source data is changed
(deleted, updated, added) while the batch {dataframe} is in progress, then the
results may not include these changes.
[float]
[[df-consistency-limitations]]
=== {cdataframe-cap} consistency does not account for deleted or updated documents
While the process for {cdataframe-transforms} allows the continual recalculation
of the {dataframe-transform} as new data is being ingested, it does also have
some limitations.
Changed entities will only be identified if their time field
has also been updated and falls within the range of the action to check for
changes. This has been designed in principle for, and is suited to, the use case
where new data is given a timestamp for the time of ingest.
If the indices that fall within the scope of the source index pattern are
removed, for example when deleting historical time-based indices, then the
composite aggregation performed in consecutive checkpoint processing will search
over different source data, and entities that only existed in the deleted index
will not be removed from the {dataframe} destination index.
Depending on your use case, you may wish to recreate the {dataframe-transform}
entirely after deletions. Alternatively, if your use case is tolerant to
historical archiving, you may wish to include a max ingest timestamp in your
aggregation. This will allow you to exclude results that have not been recently
updated when viewing the {dataframe} destination index.
[float]
[[df-deletion-limitations]]
=== Deleting a {dataframe-transform} does not delete the {dataframe} destination index or {kib} index pattern
When deleting a {dataframe-transform} using `DELETE _data_frame/transforms/index`
neither the {dataframe} destination index nor the {kib} index pattern, should
one have been created, are deleted. These objects must be deleted separately.
[float]
[[df-aggregation-page-limitations]]
=== Handling dynamic adjustment of aggregation page size
During the development of {dataframe-transforms}, control was favoured over
performance. In the design considerations, it is preferred for the
{dataframe-transform} to take longer to complete quietly in the background
rather than to finish quickly and take precedence in resource consumption.
Composite aggregations are well suited for high cardinality data enabling
pagination through results. If a {ref}/circuit-breaker.html[circuit breaker]
memory exception occurs when performing the composite aggregated search then we
try again reducing the number of buckets requested. This circuit breaker is
calculated based upon all activity within the cluster, not just activity from
{dataframe-transforms}, so it therefore may only be a temporary resource
availability issue.
For a batch {dataframe-transform}, the number of buckets requested is only ever
adjusted downwards. The lowering of value may result in a longer duration for the
transform checkpoint to complete. For {cdataframes}, the number of
buckets requested is reset back to its default at the start of every checkpoint
and it is possible for circuit breaker exceptions to occur repeatedly in the
{es} logs.
The {dataframe-transform} retrieves data in batches which means it calculates
several buckets at once. Per default this is 500 buckets per search/index
operation. The default can be changed using `max_page_search_size` and the
minimum value is 10. If failures still occur once the number of buckets
requested has been reduced to its minimum, then the {dataframe-transform} will
be set to a failed state.
[float]
[[df-dynamic-adjustments-limitations]]
=== Handling dynamic adjustments for many terms
For each checkpoint, entities are identified that have changed since the last
time the check was performed. This list of changed entities is supplied as a
{ref}/query-dsl-terms-query.html[terms query] to the {dataframe-transform}
composite aggregation, one page at a time. Then updates are applied to the
destination index for each page of entities.
The page `size` is defined by `max_page_search_size` which is also used to
define the number of buckets returned by the composite aggregation search. The
default value is 500, the minimum is 10.
The index setting
{ref}/index-modules.html#dynamic-index-settings[`index.max_terms_count`] defines
the maximum number of terms that can be used in a terms query. The default value
is 65536. If `max_page_search_size` exceeds `index.max_terms_count` the
transform will fail.
Using smaller values for `max_page_search_size` may result in a longer duration
for the transform checkpoint to complete.
[float]
[[df-scheduling-limitations]]
=== {cdataframe-cap} scheduling limitations
A {cdataframe} periodically checks for changes to source data. The functionality
of the scheduler is currently limited to a basic periodic timer which can be
within the `frequency` range from 1s to 1h. The default is 1m. This is designed
to run little and often. When choosing a `frequency` for this timer consider
your ingest rate along with the impact that the {dataframe-transform}
search/index operations has other users in your cluster. Also note that retries
occur at `frequency` interval.
[float]
[[df-failed-limitations]]
=== Handling of failed {dataframe-transforms}
Failed {dataframe-transforms} remain as a persistent task and should be handled
appropriately, either by deleting it or by resolving the root cause of the
failure and re-starting.
When using the API to delete a failed {dataframe-transform}, first stop it using
`_stop?force=true`, then delete it.
If starting a failed {dataframe-transform}, after the root cause has been
resolved, the `_start?force=true` parameter must be specified.
[float]
[[df-availability-limitations]]
=== {cdataframes-cap} may give incorrect results if documents are not yet available to search
After a document is indexed, there is a very small delay until it is available
to search.
A {cdataframe-transform} periodically checks for changed entities between the
time since it last checked and `now` minus `sync.time.delay`. This time window
moves without overlapping. If the timestamp of a recently indexed document falls
within this time window but this document is not yet available to search then
this entity will not be updated.
If using a `sync.time.field` that represents the data ingest time and using a
zero second or very small `sync.time.delay`, then it is more likely that this
issue will occur.

View File

@ -0,0 +1,71 @@
[role="xpack"]
[[ml-transform-overview]]
== {dataframe-transform-cap} overview
++++
<titleabbrev>Overview</titleabbrev>
++++
beta[]
A _{dataframe}_ is a two-dimensional tabular data structure. In the context of
the {stack}, it is a transformation of data that is indexed in {es}. For
example, you can use {dataframes} to _pivot_ your data into a new entity-centric
index. By transforming and summarizing your data, it becomes possible to
visualize and analyze it in alternative and interesting ways.
A lot of {es} indices are organized as a stream of events: each event is an
individual document, for example a single item purchase. {dataframes-cap} enable
you to summarize this data, bringing it into an organized, more
analysis-friendly format. For example, you can summarize all the purchases of a
single customer.
You can create {dataframes} by using {dataframe-transforms}.
{dataframe-transforms-cap} enable you to define a pivot, which is a set of
features that transform the index into a different, more digestible format.
Pivoting results in a summary of your data, which is the {dataframe}.
To define a pivot, first you select one or more fields that you will use to
group your data. You can select categorical fields (terms) and numerical fields
for grouping. If you use numerical fields, the field values are bucketed using
an interval that you specify.
The second step is deciding how you want to aggregate the grouped data. When
using aggregations, you practically ask questions about the index. There are
different types of aggregations, each with its own purpose and output. To learn
more about the supported aggregations and group-by fields, see
{ref}/data-frame-transform-resource.html[{dataframe-transform-cap} resources].
As an optional step, you can also add a query to further limit the scope of the
aggregation.
The {dataframe-transform} performs a composite aggregation that
paginates through all the data defined by the source index query. The output of
the aggregation is stored in a destination index. Each time the
{dataframe-transform} queries the source index, it creates a _checkpoint_. You
can decide whether you want the {dataframe-transform} to run once (batch
{dataframe-transform}) or continuously ({cdataframe-transform}). A batch
{dataframe-transform} is a single operation that has a single checkpoint.
{cdataframe-transforms-cap} continually increment and process checkpoints as new
source data is ingested.
.Example
Imagine that you run a webshop that sells clothes. Every order creates a document
that contains a unique order ID, the name and the category of the ordered product,
its price, the ordered quantity, the exact date of the order, and some customer
information (name, gender, location, etc). Your dataset contains all the transactions
from last year.
If you want to check the sales in the different categories in your last fiscal
year, define a {dataframe-transform} that groups the data by the product
categories (women's shoes, men's clothing, etc.) and the order date. Use the
last year as the interval for the order date. Then add a sum aggregation on the
ordered quantity. The result is a {dataframe} that shows the number of sold
items in every product category in the last year.
[role="screenshot"]
image::images/ml-dataframepivot.jpg["Example of a data frame pivot in {kib}"]
IMPORTANT: The {dataframe-transform} leaves your source index intact. It
creates a new index that is dedicated to the {dataframe}.

View File

@ -0,0 +1,29 @@
[[dataframe-troubleshooting]]
== Troubleshooting {dataframe-transforms}
[subs="attributes"]
++++
<titleabbrev>Troubleshooting</titleabbrev>
++++
Use the information in this section to troubleshoot common problems.
include::{stack-repo-dir}/help.asciidoc[tag=get-help]
If you encounter problems with your {dataframe-transforms}, you can gather more
information from the following files and APIs:
* Lightweight audit messages are stored in `.data-frame-notifications-*`. Search
by your `transform_id`.
* The
{ref}/get-data-frame-transform-stats.html[get {dataframe-transform} statistics API]
provides information about the transform status and failures.
* If the {dataframe-transform} exists as a task, you can use the
{ref}/tasks.html[task management API] to gather task information. For example:
`GET _tasks?actions=data_frame/transforms*&detailed`. Typically, the task exists
when the transform is in a started or failed state.
* The {es} logs from the node that was running the {dataframe-transform} might
also contain useful information. You can identify the node from the notification
messages. Alternatively, if the task still exists, you can get that information
from the get {dataframe-transform} statistics API. For more information, see
{ref}/logging.html[Logging configuration].