[DOCS] Add top-level Data management section. (#64185) (#64322)

* [DOCS] Add top-level Data management section.

* Edits

* Edits

* Fixed xrefs

* Apply suggestions from code review

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

* Update docs/reference/datatiers.asciidoc

* Update docs/reference/datatiers.asciidoc

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>

Co-authored-by: Andrei Dan <andrei.dan@elastic.co>
Co-authored-by: Lee Hinman <dakrone@users.noreply.github.com>
This commit is contained in:
debadair 2020-10-28 16:39:35 -07:00 committed by GitHub
parent 9aabf3a50d
commit 536e100125
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 116 additions and 71 deletions

View File

@ -0,0 +1,33 @@
[role="xpack"]
[[data-management]]
= Data management
[partintro]
--
The data you store in {es} generally falls into one of two categories:
* Content: a collection of items you want to search, such as a catalog of products
* Time series data: a stream of continuously-generated timestamped data, such as log entries
Content might be frequently updated,
but the value of the content remains relatively constant over time.
You want to be able to retrieve items quickly regardless of how old they are.
Time series data keeps accumulating over time, so you need strategies for
balancing the value of the data against the cost of storing it.
As it ages, it tends to become less important and less-frequently accessed,
so you can move it to less expensive, less performant hardware.
For your oldest data, what matters is that you have access to the data.
It's ok if queries take longer to complete.
To help you manage your data, {es} enables you to:
* Define <<data-tiers, multiple tiers>> of data nodes with different performance characteristics.
* Automatically transition indices through the data tiers according to your performance needs and retention policies
with <<index-lifecycle-management, {ilm}>> ({ilm-init}).
* Leverage <<searchable-snapshots, searchable snapshots>> stored in a remote repository to provide resiliency
for your older indices while reducing operating costs and maintaining search performance.
* Perform <<async-search-intro, asynchronous searches>> of data stored on less-performant hardware.
--
include::datatiers.asciidoc[]

View File

@ -1,100 +1,112 @@
[role="xpack"] [role="xpack"]
[[data-tiers]] [[data-tiers]]
=== Data tiers == Data tiers
Common data lifecycle management patterns revolve around transitioning indices A _data tier_ is a collection of nodes with the same data role that
through multiple collections of nodes with different hardware characteristics in order typically share the same hardware profile:
to fulfil evolving CRUD, search, and aggregation needs as indices age. The concept
of a tiered hardware architecture is not new in {es}.
<<index-lifecycle-management, Index Lifecycle Management>> is instrumental in
implementing tiered architectures by automating the managemnt of indices according to
performance, resiliency and data retention requirements.
<<overview-index-lifecycle-management, Hot/warm/cold>> architectures are common
for timeseries data such as logging and metrics.
A data tier is a collection of nodes with the same role. Data tiers are an integrated * <<content-tier, Content tier>> nodes handle the indexing and query load for content such as a product catalog.
solution offering better support for optimising cost and improving performance. * <<hot-tier, Hot tier>> nodes handle the indexing load for time series data such as logs or metrics
Formalized data tiers in ES allow configuration of the lifecycle and location of data and hold your most recent, most-frequently-accessed data.
in a hot/warm/cold topology without requiring the use of custom node attributes. * <<warm-tier, Warm tier>> nodes hold time series data that is accessed less-frequently
Each tier formalises specific characteristics and data behaviours. and rarely needs to be updated.
* <<cold-tier, Cold tier>> nodes hold time series data that is accessed occasionally and not normally updated.
The node roles that can currently define data tiers are: When you index documents directly to a specific index, they remain on content tier nodes indefinitely.
* <<data-content-node, data_content>> When you index documents to a data stream, they initially reside on hot tier nodes.
* <<data-hot-node, data_hot>> You can configure <<index-lifecycle-management, {ilm}>> ({ilm-init}) policies
* <<data-warm-node, data_warm>> to automatically transition your time series data through the hot, warm, and cold tiers
* <<data-cold-node, data_cold>> according to your performance, resiliency and data retention requirements.
The more generic <<data-node, data role>> is not a data tier role, but A node's <<data-node, data role>> is configured in `elasticsearch.yml`.
it is the default node role if no roles are configured. If a node has the For example, the highest-performance nodes in a cluster might be assigned to both the hot and content tiers:
<<data-node, data>> role we treat the node as if it has all of the tier
roles assigned.
[source,yaml]
--------------------------------------------------
node.roles: ["data_hot", "data_content"]
--------------------------------------------------
[discrete]
[[content-tier]] [[content-tier]]
==== Content tier === Content tier
The content tier is made of one or more nodes that have the <<data-content-node, data_content>> Data stored in the content tier is generally a collection of items such as a product catalog or article archive.
role. A content tier is designed to store and search user created content. Non-timeseries data Unlike time series data, the value of the content remains relatively constant over time,
doesn't necessarily follow the hot-warm-cold path. The hardware profiles are quite different to so it doesn't make sense to move it to a tier with different performance characteristics as it ages.
the <<hot-tier, hot tier>>. User created content prioritises high CPU to support complex Content data typically has long data retention requirements, and you want to be able to retrieve
queries and aggregations in a timely manner, as opposed to the <<hot-tier, hot tier>> which items quickly regardless of how old they are.
prioritises high IO.
The content data has very long data retention characteristics and from a resiliency perspective
the indices in this tier should be configured to use one or more replicas.
NOTE: new indices that are not part of <<data-streams, data streams>> will be automatically allocated to the Content tier nodes are usually optimized for query performance--they prioritize processing power over IO throughput
<<content-tier>> so they can process complex searches and aggregations and return results quickly.
While they are also responsible for indexing, content data is generally not ingested at as high a rate
as time series data such as logs and metrics. From a resiliency perspective the indices in this
tier should be configured to use one or more replicas.
New indices are automatically allocated to the <<content-tier>> unless they are part of a data stream.
[discrete]
[[hot-tier]] [[hot-tier]]
==== Hot tier === Hot tier
The hot tier is made of one or more nodes that have the <<data-hot-node, data_hot>> role. The hot tier is the {es} entry point for time series data and holds your most-recent,
It is the {es} entry point for timeseries data. This tier needs to be fast both for reads most-frequently-searched time series data.
and writes, requiring more hardware resources such as SSD drives. The hot tier is usually Nodes in the hot tier need to be fast for both reads and writes,
hosting the data from recent days. From a resiliency perspective the indices in this which requires more hardware resources and faster storage (SSDs).
tier should be configured to use one or more replicas. For resiliency, indices in the hot tier should be configured to use one or more replicas.
NOTE: new indices that are part of a <<data-streams, data stream>> will be automatically allocated to the New indices that are part of a <<data-streams, data stream>> are automatically allocated to the
<<hot-tier>> hot tier.
[discrete]
[[warm-tier]] [[warm-tier]]
==== Warm tier === Warm tier
The warm tier is made of one or more nodes that have the <<data-warm-node, data_warm>> role. Time series data can move to the warm tier once it is being queried less frequently
This tier is where data goes once it is not queried as frequently as in the <<hot-tier, hot tier>>. than the recently-indexed data in the hot tier.
It is a medium-fast tier that still allows data updates. The warm tier is usually The warm tier typically holds data from recent weeks.
hosting the data from recent weeks. From a resiliency perspective the indices in this Updates are still allowed, but likely infrequent.
tier should be configured to use one or more replicas. Nodes in the warm tier generally don't need to be as fast as those in the hot tier.
For resiliency, indices in the warm tier should be configured to use one or more replicas.
[discrete]
[[cold-tier]] [[cold-tier]]
==== Cold tier === Cold tier
The cold tier is made of one or more nodes that have the <<data-cold-node, data_cold>> role. Once data in the warm tier is no longer being updated, it can move to the cold tier.
Once the data in the <<warm-tier, warm tier>> is not updated anymore it can transition to the The cold tier typically holds the data from recent months or years.
cold tier. The cold tier is still a responsive query tier but as the data transitions into this The cold tier is still a responsive query tier, but data in the cold tier is not normally updated.
tier it can be compressed, shrunken, or configured to have zero replicas and be backed by As data transitions into the cold tier it can be compressed and shrunken.
a <<ilm-searchable-snapshot, snapshot>>. The cold tier is usually hosting the data from recent For resiliency, indices in the cold tier can rely on
months or years. <<ilm-searchable-snapshot, searchable snapshots>>, eliminating the need for replicas.
[discrete] [discrete]
[[data-tier-allocation]] [[data-tier-allocation]]
=== Data tier index allocation === Data tier index allocation
When an index is created {es} will automatically allocate the index to the <<content-tier, Content tier>> When you create an index, by default {es} sets
if the index is not part of a <<data-streams, data stream>> or to the <<hot-tier, Hot tier>> if the index <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
is part of a <<data-streams, data stream>>. to `data_content` to automatically allocate the index shards to the content tier.
{es} will configure the <<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_content` or `data_hot` respectively.
These heuristics can be overridden by specifying any <<shard-allocation-filtering, shard allocation filtering>> When {es} creates an index as part of a <<data-streams, data stream>>,
by default {es} sets
<<tier-preference-allocation-filter, `index.routing.allocation.include._tier_preference`>>
to `data_hot` to automatically allocate the index shards to the hot tier.
You can override the automatic tier-based allocation by specifying
<<shard-allocation-filtering, shard allocation filtering>>
settings in the create index request or index template that matches the new index. settings in the create index request or index template that matches the new index.
Specifying any configuration, including `null`, for `index.routing.allocation.include._tier_preference` will
also opt out of the automatic new index allocation to tiers. You can also explicitly set `index.routing.allocation.include._tier_preference`
to opt out of the default tier-based allocation.
If you set the tier preference to `null`, {es} ignores the data tier roles during allocation.
[discrete] [discrete]
[[data-tier-migration]] [[data-tier-migration]]
=== Data tier index migration === Automatic data tier migration
<<index-lifecycle-management, Index Lifecycle Management>> automates the transition of managed {ilm-init} automatically transitions managed
indices through the available data tiers using the `migrate` action which is injected indices through the available data tiers using the <<ilm-migrate-action, migrate>> action.
in every phase, unless it's manually specified in the phase or an By default, this action is automatically injected in every phase.
<<ilm-allocate-action, allocate action>> modifying the allocation rules is manually configured. You can explicitly specify the migrate action to override the default behavior,
or use the <<ilm-allocate-action, allocate action>> to manually specify allocation rules.

View File

@ -30,8 +30,6 @@ include::indices/index-templates.asciidoc[]
include::data-streams/data-streams.asciidoc[] include::data-streams/data-streams.asciidoc[]
include::datatiers.asciidoc[]
include::ingest.asciidoc[] include::ingest.asciidoc[]
include::search/search-your-data/search-your-data.asciidoc[] include::search/search-your-data/search-your-data.asciidoc[]
@ -46,6 +44,8 @@ include::sql/index.asciidoc[]
include::scripting.asciidoc[] include::scripting.asciidoc[]
include::data-management.asciidoc[]
include::ilm/index.asciidoc[] include::ilm/index.asciidoc[]
ifdef::permanently-unreleased-branch[] ifdef::permanently-unreleased-branch[]