[DOCS] Add top-level Data management section. (#64185) (#64322)

* [DOCS] Add top-level Data management section.

* Edits

* Edits

* Fixed xrefs

* Apply suggestions from code review

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

* Update docs/reference/datatiers.asciidoc

* Update docs/reference/datatiers.asciidoc

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
This commit is contained in:
debadair 2020-10-28 16:39:35 -07:00 committed by GitHub
parent 9aabf3a50d
commit 536e100125
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 116 additions and 71 deletions

View File

@ -0,0 +1,33 @@
[role="xpack"]
[[data-management]]
= Data management
[partintro]
--
The data you store in {es} generally falls into one of two categories:
* Content: a collection of items you want to search, such as a catalog of products
* Time series data: a stream of continuously-generated timestamped data, such as log entries
Content might be frequently updated,
but the value of the content remains relatively constant over time.
You want to be able to retrieve items quickly regardless of how old they are.
Time series data keeps accumulating over time, so you need strategies for
balancing the value of the data against the cost of storing it.
As it ages, it tends to become less important and less-frequently accessed,
so you can move it to less expensive, less performant hardware.
For your oldest data, what matters is that you have access to the data.
It's ok if queries take longer to complete.
To help you manage your data, {es} enables you to:
* Define <<data-tiers, multiple tiers>> of data nodes with different performance characteristics.
* Automatically transition indices through the data tiers according to your performance needs and retention policies
with <<index-lifecycle-management, {ilm}>> ({ilm-init}).
* Leverage <<searchable-snapshots, searchable snapshots>> stored in a remote repository to provide resiliency
for your older indices while reducing operating costs and maintaining search performance.
* Perform <<async-search-intro, asynchronous searches>> of data stored on less-performant hardware.
--
include::datatiers.asciidoc[]

View File

@ -1,100 +1,112 @@
[role="xpack"]
[[data-tiers]]
=== Data tiers
== Data tiers
Common data lifecycle management patterns revolve around transitioning indices
through multiple collections of nodes with different hardware characteristics in order
to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept
of a tiered hardware architecture is not new in {es}.
<<index-lifecycle-management, Index Lifecycle Management>> is instrumental in
implementing tiered architectures by automating the managemnt of indices according to
performance, resiliency and data retention requirements.
<<overview-index-lifecycle-management, Hot/warm/cold>> architectures are common
for timeseries data such as logging and metrics.
A _data tier_ is a collection of nodes with the same data role that
typically share the same hardware profile:
A data tier is a collection of nodes with the same role. Data tiers are an integrated
solution offering better support for optimising cost and improving performance.
Formalized data tiers in ES allow configuration of the lifecycle and location of data
in a hot/warm/cold topology without requiring the use of custom node attributes.
Each tier formalises specific characteristics and data behaviours.
* <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog.
* <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics
and hold your most recent, most-frequently-accessed data.
* <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently
and rarely needs to be updated.
* <<cold-tier, Cold tier>> nodes hold time series data that is accessed occasionally and not normally updated.
The node roles that can currently define data tiers are:
When you index documents directly to a specific index, they remain on content tier nodes indefinitely.
* <<data-content-node, data_content>>
* <<data-hot-node, data_hot>>
* <<data-warm-node, data_warm>>
* <<data-cold-node, data_cold>>
When you index documents to a data stream, they initially reside on hot tier nodes.
You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies
to automatically transition your time series data through the hot, warm, and cold tiers
according to your performance, resiliency and data retention requirements.
The more generic <<data-node, data role>> is not a data tier role, but
it is the default node role if no roles are configured. If a node has the
<<data-node, data>> role we treat the node as if it has all of the tier
roles assigned.
A node's <<data-node, data role>> is configured in `elasticsearch.yml`.
For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:
[source,yaml]
--------------------------------------------------
node.roles: ["data_hot", "data_content"]
--------------------------------------------------
[discrete]
[[content-tier]]
==== Content tier
=== Content tier
The content tier is made of one or more nodes that have the <<data-content-node, data_content>>
role. A content tier is designed to store and search user created content. Non-timeseries data
doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to
the <<hot-tier, hot tier>>. User created content prioritises high CPU to support complex
queries and aggregations in a timely manner, as opposed to the <<hot-tier, hot tier>> which
prioritises high IO.
The content data has very long data retention characteristics and from a resiliency perspective
the indices in this tier should be configured to use one or more replicas.
Data stored in the content tier is generally a collection of items such as a product catalog or article archive.
Unlike time series data, the value of the content remains relatively constant over time,
so it doesn't make sense to move it to a tier with different performance characteristics as it ages.
Content data typically has long data retention requirements, and you want to be able to retrieve
items quickly regardless of how old they are.
NOTE: new indices that are not part of <<data-streams, data streams>> will be automatically allocated to the
<<content-tier>>
Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput
so they can process complex searches and aggregations and return results quickly.
While they are also responsible for indexing, content data is generally not ingested at as high a rate
as time series data such as logs and metrics. From a resiliency perspective the indices in this
tier should be configured to use one or more replicas.
New indices are automatically allocated to the <<content-tier>> unless they are part of a data stream.
[discrete]
[[hot-tier]]
==== Hot tier
=== Hot tier
The hot tier is made of one or more nodes that have the <<data-hot-node, data_hot>> role.
It is the {es} entry point for timeseries data. This tier needs to be fast both for reads
and writes, requiring more hardware resources such as SSD drives. The hot tier is usually
hosting the data from recent days. From a resiliency perspective the indices in this
tier should be configured to use one or more replicas.
The hot tier is the {es} entry point for time series data and holds your most-recent,
most-frequently-searched time series data.
Nodes in the hot tier need to be fast for both reads and writes,
which requires more hardware resources and faster storage (SSDs).
For resiliency, indices in the hot tier should be configured to use one or more replicas.
NOTE: new indices that are part of a <<data-streams, data stream>> will be automatically allocated to the
<<hot-tier>>
New indices that are part of a <<data-streams, data stream>> are automatically allocated to the
hot tier.
[discrete]
[[warm-tier]]
==== Warm tier
=== Warm tier
The warm tier is made of one or more nodes that have the <<data-warm-node, data_warm>> role.
This tier is where data goes once it is not queried as frequently as in the <<hot-tier, hot tier>>.
It is a medium-fast tier that still allows data updates. The warm tier is usually
hosting the data from recent weeks. From a resiliency perspective the indices in this
tier should be configured to use one or more replicas.
Time series data can move to the warm tier once it is being queried less frequently
than the recently-indexed data in the hot tier.
The warm tier typically holds data from recent weeks.
Updates are still allowed, but likely infrequent.
Nodes in the warm tier generally don't need to be as fast as those in the hot tier.
For resiliency, indices in the warm tier should be configured to use one or more replicas.
[discrete]
[[cold-tier]]
==== Cold tier
=== Cold tier
The cold tier is made of one or more nodes that have the <<data-cold-node, data_cold>> role.
Once the data in the <<warm-tier, warm tier>> is not updated anymore it can transition to the
cold tier. The cold tier is still a responsive query tier but as the data transitions into this
tier it can be compressed, shrunken, or configured to have zero replicas and be backed by
a <<ilm-searchable-snapshot, snapshot>>. The cold tier is usually hosting the data from recent
months or years.
Once data in the warm tier is no longer being updated, it can move to the cold tier.
The cold tier typically holds the data from recent months or years.
The cold tier is still a responsive query tier, but data in the cold tier is not normally updated.
As data transitions into the cold tier it can be compressed and shrunken.
For resiliency, indices in the cold tier can rely on
<<ilm-searchable-snapshot, searchable snapshots>>, eliminating the need for replicas.
[discrete]
[[data-tier-allocation]]
=== Data tier index allocation
When an index is created {es} will automatically allocate the index to the <<content-tier, Content tier>>
if the index is not part of a <<data-streams, data stream>> or to the <<hot-tier, Hot tier>> if the index
is part of a <<data-streams, data stream>>.
{es} will configure the <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_content` or `data_hot` respectively.
When you create an index, by default {es} sets
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_content` to automatically allocate the index shards to the content tier.
These heuristics can be overridden by specifying any <<shard-allocation-filtering, shard allocation filtering>>
When {es} creates an index as part of a <<data-streams, data stream>>,
by default {es} sets
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_hot` to automatically allocate the index shards to the hot tier.
You can override the automatic tier-based allocation by specifying
<<shard-allocation-filtering, shard allocation filtering>>
settings in the create index request or index template that matches the new index.
Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will
also opt out of the automatic new index allocation to tiers.
You can also explicitly set `index.routing.allocation.include._tier_preference`
to opt out of the default tier-based allocation.
If you set the tier preference to `null`, {es} ignores the data tier roles during allocation.
[discrete]
[[data-tier-migration]]
=== Data tier index migration
=== Automatic data tier migration
<<index-lifecycle-management, Index Lifecycle Management>> automates the transition of managed
indices through the available data tiers using the `migrate` action which is injected
in every phase, unless it's manually specified in the phase or an
<<ilm-allocate-action, allocate action>> modifying the allocation rules is manually configured.
{ilm-init} automatically transitions managed
indices through the available data tiers using the <<ilm-migrate-action, migrate>> action.
By default, this action is automatically injected in every phase.
You can explicitly specify the migrate action to override the default behavior,
or use the <<ilm-allocate-action, allocate action>> to manually specify allocation rules.

View File

@ -30,8 +30,6 @@ include::indices/index-templates.asciidoc[]
include::data-streams/data-streams.asciidoc[]
include::datatiers.asciidoc[]
include::ingest.asciidoc[]
include::search/search-your-data/search-your-data.asciidoc[]
@ -46,6 +44,8 @@ include::sql/index.asciidoc[]
include::scripting.asciidoc[]
include::data-management.asciidoc[]
include::ilm/index.asciidoc[]
ifdef::permanently-unreleased-branch[]