From 536e1001253229ad18577a91ec23cc1f314dee33 Mon Sep 17 00:00:00 2001 From: debadair Date: Wed, 28 Oct 2020 16:39:35 -0700 Subject: [PATCH] [DOCS] Add top-level Data management section. (#64185) (#64322) * [DOCS] Add top-level Data management section. * Edits * Edits * Fixed xrefs * Apply suggestions from code review Co-authored-by: Andrei Dan Co-authored-by: Lee Hinman * Update docs/reference/datatiers.asciidoc * Update docs/reference/datatiers.asciidoc Co-authored-by: Andrei Dan Co-authored-by: Lee Hinman Co-authored-by: Andrei Dan Co-authored-by: Lee Hinman --- docs/reference/data-management.asciidoc | 33 ++++++ docs/reference/datatiers.asciidoc | 150 +++++++++++++----------- docs/reference/index.asciidoc | 4 +- 3 files changed, 116 insertions(+), 71 deletions(-) create mode 100644 docs/reference/data-management.asciidoc diff --git a/docs/reference/data-management.asciidoc b/docs/reference/data-management.asciidoc new file mode 100644 index 00000000000..10f9f155f2d --- /dev/null +++ b/docs/reference/data-management.asciidoc @@ -0,0 +1,33 @@ +[role="xpack"] +[[data-management]] += Data management + +[partintro] +-- +The data you store in {es} generally falls into one of two categories: + +* Content: a collection of items you want to search, such as a catalog of products +* Time series data: a stream of continuously-generated timestamped data, such as log entries + +Content might be frequently updated, +but the value of the content remains relatively constant over time. +You want to be able to retrieve items quickly regardless of how old they are. + +Time series data keeps accumulating over time, so you need strategies for +balancing the value of the data against the cost of storing it. +As it ages, it tends to become less important and less-frequently accessed, +so you can move it to less expensive, less performant hardware. +For your oldest data, what matters is that you have access to the data. +It's ok if queries take longer to complete. + +To help you manage your data, {es} enables you to: + +* Define <> of data nodes with different performance characteristics. +* Automatically transition indices through the data tiers according to your performance needs and retention policies +with <> ({ilm-init}). +* Leverage <> stored in a remote repository to provide resiliency +for your older indices while reducing operating costs and maintaining search performance. +* Perform <> of data stored on less-performant hardware. +-- + +include::datatiers.asciidoc[] diff --git a/docs/reference/datatiers.asciidoc b/docs/reference/datatiers.asciidoc index c84fc9a6642..69d4990c3bf 100644 --- a/docs/reference/datatiers.asciidoc +++ b/docs/reference/datatiers.asciidoc @@ -1,100 +1,112 @@ [role="xpack"] [[data-tiers]] -=== Data tiers +== Data tiers -Common data lifecycle management patterns revolve around transitioning indices -through multiple collections of nodes with different hardware characteristics in order -to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept -of a tiered hardware architecture is not new in {es}. -<> is instrumental in -implementing tiered architectures by automating the managemnt of indices according to -performance, resiliency and data retention requirements. -<> architectures are common -for timeseries data such as logging and metrics. +A _data tier_ is a collection of nodes with the same data role that +typically share the same hardware profile: -A data tier is a collection of nodes with the same role. Data tiers are an integrated -solution offering better support for optimising cost and improving performance. -Formalized data tiers in ES allow configuration of the lifecycle and location of data -in a hot/warm/cold topology without requiring the use of custom node attributes. -Each tier formalises specific characteristics and data behaviours. +* <> nodes handle the indexing and query load for content such as a product catalog. +* <> nodes handle the indexing load for time series data such as logs or metrics +and hold your most recent, most-frequently-accessed data. +* <> nodes hold time series data that is accessed less-frequently +and rarely needs to be updated. +* <> nodes hold time series data that is accessed occasionally and not normally updated. -The node roles that can currently define data tiers are: +When you index documents directly to a specific index, they remain on content tier nodes indefinitely. -* <> -* <> -* <> -* <> +When you index documents to a data stream, they initially reside on hot tier nodes. +You can configure <> ({ilm-init}) policies +to automatically transition your time series data through the hot, warm, and cold tiers +according to your performance, resiliency and data retention requirements. -The more generic <> is not a data tier role, but -it is the default node role if no roles are configured. If a node has the -<> role we treat the node as if it has all of the tier -roles assigned. +A node's <> is configured in `elasticsearch.yml`. +For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers: +[source,yaml] +-------------------------------------------------- +node.roles: ["data_hot", "data_content"] +-------------------------------------------------- + +[discrete] [[content-tier]] -==== Content tier +=== Content tier -The content tier is made of one or more nodes that have the <> -role. A content tier is designed to store and search user created content. Non-timeseries data -doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to -the <>. User created content prioritises high CPU to support complex -queries and aggregations in a timely manner, as opposed to the <> which -prioritises high IO. -The content data has very long data retention characteristics and from a resiliency perspective -the indices in this tier should be configured to use one or more replicas. +Data stored in the content tier is generally a collection of items such as a product catalog or article archive. +Unlike time series data, the value of the content remains relatively constant over time, +so it doesn't make sense to move it to a tier with different performance characteristics as it ages. +Content data typically has long data retention requirements, and you want to be able to retrieve +items quickly regardless of how old they are. -NOTE: new indices that are not part of <> will be automatically allocated to the -<> +Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput +so they can process complex searches and aggregations and return results quickly. +While they are also responsible for indexing, content data is generally not ingested at as high a rate +as time series data such as logs and metrics. From a resiliency perspective the indices in this +tier should be configured to use one or more replicas. +New indices are automatically allocated to the <> unless they are part of a data stream. + +[discrete] [[hot-tier]] -==== Hot tier +=== Hot tier -The hot tier is made of one or more nodes that have the <> role. -It is the {es} entry point for timeseries data. This tier needs to be fast both for reads -and writes, requiring more hardware resources such as SSD drives. The hot tier is usually -hosting the data from recent days. From a resiliency perspective the indices in this -tier should be configured to use one or more replicas. +The hot tier is the {es} entry point for time series data and holds your most-recent, +most-frequently-searched time series data. +Nodes in the hot tier need to be fast for both reads and writes, +which requires more hardware resources and faster storage (SSDs). +For resiliency, indices in the hot tier should be configured to use one or more replicas. -NOTE: new indices that are part of a <> will be automatically allocated to the -<> +New indices that are part of a <> are automatically allocated to the +hot tier. +[discrete] [[warm-tier]] -==== Warm tier +=== Warm tier -The warm tier is made of one or more nodes that have the <> role. -This tier is where data goes once it is not queried as frequently as in the <>. -It is a medium-fast tier that still allows data updates. The warm tier is usually -hosting the data from recent weeks. From a resiliency perspective the indices in this -tier should be configured to use one or more replicas. +Time series data can move to the warm tier once it is being queried less frequently +than the recently-indexed data in the hot tier. +The warm tier typically holds data from recent weeks. +Updates are still allowed, but likely infrequent. +Nodes in the warm tier generally don't need to be as fast as those in the hot tier. +For resiliency, indices in the warm tier should be configured to use one or more replicas. +[discrete] [[cold-tier]] -==== Cold tier +=== Cold tier -The cold tier is made of one or more nodes that have the <> role. -Once the data in the <> is not updated anymore it can transition to the -cold tier. The cold tier is still a responsive query tier but as the data transitions into this -tier it can be compressed, shrunken, or configured to have zero replicas and be backed by -a <>. The cold tier is usually hosting the data from recent -months or years. +Once data in the warm tier is no longer being updated, it can move to the cold tier. +The cold tier typically holds the data from recent months or years. +The cold tier is still a responsive query tier, but data in the cold tier is not normally updated. +As data transitions into the cold tier it can be compressed and shrunken. +For resiliency, indices in the cold tier can rely on +<>, eliminating the need for replicas. [discrete] [[data-tier-allocation]] === Data tier index allocation -When an index is created {es} will automatically allocate the index to the <> -if the index is not part of a <> or to the <> if the index -is part of a <>. -{es} will configure the <> -to `data_content` or `data_hot` respectively. +When you create an index, by default {es} sets +<> +to `data_content` to automatically allocate the index shards to the content tier. -These heuristics can be overridden by specifying any <> +When {es} creates an index as part of a <>, +by default {es} sets +<> +to `data_hot` to automatically allocate the index shards to the hot tier. + +You can override the automatic tier-based allocation by specifying +<> settings in the create index request or index template that matches the new index. -Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will -also opt out of the automatic new index allocation to tiers. + +You can also explicitly set `index.routing.allocation.include._tier_preference` +to opt out of the default tier-based allocation. +If you set the tier preference to `null`, {es} ignores the data tier roles during allocation. + [discrete] [[data-tier-migration]] -=== Data tier index migration +=== Automatic data tier migration -<> automates the transition of managed -indices through the available data tiers using the `migrate` action which is injected -in every phase, unless it's manually specified in the phase or an -<> modifying the allocation rules is manually configured. +{ilm-init} automatically transitions managed +indices through the available data tiers using the <> action. +By default, this action is automatically injected in every phase. +You can explicitly specify the migrate action to override the default behavior, +or use the <> to manually specify allocation rules. diff --git a/docs/reference/index.asciidoc b/docs/reference/index.asciidoc index 9bf8470e049..1e39636c319 100644 --- a/docs/reference/index.asciidoc +++ b/docs/reference/index.asciidoc @@ -30,8 +30,6 @@ include::indices/index-templates.asciidoc[] include::data-streams/data-streams.asciidoc[] -include::datatiers.asciidoc[] - include::ingest.asciidoc[] include::search/search-your-data/search-your-data.asciidoc[] @@ -46,6 +44,8 @@ include::sql/index.asciidoc[] include::scripting.asciidoc[] +include::data-management.asciidoc[] + include::ilm/index.asciidoc[] ifdef::permanently-unreleased-branch[]