hadoop/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html

750 lines
47 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-02-28
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop 3.4.0-SNAPSHOT &#x2013; HDFS Erasure Coding</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230228" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
|
<a href="http://hadoop.apache.org/" class="externalLink">Apache Hadoop</a>
&nbsp;| Last Published: 2023-02-28
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>HDFS Erasure Coding</h1>
<ul>
<li><a href="#Purpose">Purpose</a></li>
<li><a href="#Background">Background</a></li>
<li><a href="#Architecture">Architecture</a></li>
<li><a href="#Deployment">Deployment</a>
<ul>
<li><a href="#Cluster_and_hardware_configuration">Cluster and hardware configuration</a></li>
<li><a href="#Configuration_keys">Configuration keys</a></li>
<li><a href="#Enable_Intel_ISA-L">Enable Intel ISA-L</a></li>
<li><a href="#Administrative_commands">Administrative commands</a></li></ul></li>
<li><a href="#Limitations">Limitations</a></li></ul>
<section>
<h2><a name="Purpose"></a>Purpose</h2>
<p>Replication is expensive &#x2013; the default 3x replication scheme in HDFS has 200% overhead in storage space and other resources (e.g., network bandwidth). However, for warm and cold datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica.</p>
<p>Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication, which provides the same level of fault-tolerance with much less storage space. In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%. Replication factor of an EC file is meaningless. It is always 1 and cannot be changed via -setrep command.</p></section><section>
<h2><a name="Background"></a>Background</h2>
<p>In storage systems, the most notable usage of EC is Redundant Array of Inexpensive Disks (RAID). RAID implements EC through striping, which divides logically sequential data (such as a file) into smaller units (such as bit, byte, or block) and stores consecutive units on different disks. In the rest of this guide this unit of striping distribution is termed a striping cell (or cell). For each stripe of original data cells, a certain number of parity cells are calculated and stored &#x2013; the process of which is called encoding. The error on any striping cell can be recovered through decoding calculation based on surviving data and parity cells.</p>
<p>Integrating EC with HDFS can improve storage efficiency while still providing similar data durability as traditional replication-based HDFS deployments. As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks of disk space.</p></section><section>
<h2><a name="Architecture"></a>Architecture</h2>
<p>In the context of EC, striping has several critical advantages. First, it enables online EC (writing data immediately in EC format), avoiding a conversion phase and immediately saving storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk spindles in parallel; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodes and eliminates the need to bundle multiple files into a single coding group. This greatly simplifies file operations such as deletion, quota reporting, and migration between federated namespaces.</p>
<p>In typical HDFS clusters, small files can account for over 3/4 of total storage consumption. To better support small files, in this first phase of work HDFS supports EC with striping. In the future, HDFS will also support a contiguous EC layout. See the design doc and discussion on <a class="externalLink" href="https://issues.apache.org/jira/browse/HDFS-7285">HDFS-7285</a> for more information.</p>
<ul>
<li>
<p><b>NameNode Extensions</b> - Striped HDFS files are logically composed of block groups, each of which contains a certain number of internal blocks. To reduce NameNode memory consumption from these additional blocks, a new hierarchical block naming protocol was introduced. The ID of a block group can be inferred from the ID of any of its internal blocks. This allows management at the level of the block group rather than the block.</p>
</li>
<li>
<p><b>Client Extensions</b> - The client read and write paths were enhanced to work on multiple internal blocks in a block group in parallel. On the output / write path, DFSStripedOutputStream manages a set of data streamers, one for each DataNode storing an internal block in the current block group. The streamers mostly work asynchronously. A coordinator takes charge of operations on the entire block group, including ending the current block group, allocating a new block group, and so forth. On the input / read path, DFSStripedInputStream translates a requested logical byte range of data as ranges into internal blocks stored on DataNodes. It then issues read requests in parallel. Upon failures, it issues additional read requests for decoding.</p>
</li>
<li>
<p><b>DataNode Extensions</b> - The DataNode runs an additional ErasureCodingWorker (ECWorker) task for background recovery of failed erasure coded blocks. Failed EC blocks are detected by the NameNode, which then chooses a DataNode to do the recovery work. The recovery task is passed as a heartbeat response. This process is similar to how replicated blocks are re-replicated on failure. Reconstruction performs three key tasks:</p>
<ol style="list-style-type: decimal">
<li>
<p><i>Read the data from source nodes:</i> Input data is read in parallel from source nodes using a dedicated thread pool. Based on the EC policy, it schedules the read requests to all source targets and reads only the minimum number of input blocks for reconstruction.</p>
</li>
<li>
<p><i>Decode the data and generate the output data:</i> New data and parity blocks are decoded from the input data. All missing data and parity blocks are decoded together.</p>
</li>
<li>
<p><i>Transfer the generated data blocks to target nodes:</i> Once decoding is finished, the recovered blocks are transferred to target DataNodes.</p>
</li>
</ol>
</li>
<li>
<p><b>Erasure coding policies</b> To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster to have different replication and erasure coding policies. The erasure coding policy encapsulates how to encode/decode a file. Each policy is defined by the following pieces of information:</p>
<ol style="list-style-type: decimal">
<li>
<p><i>The EC schema:</i> This includes the numbers of data and parity blocks in an EC group (e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon, XOR).</p>
</li>
<li>
<p><i>The size of a striping cell.</i> This determines the granularity of striped reads and writes, including buffer sizes and encoding work.</p>
</li>
</ol>
<p>Policies are named <i>codec</i>-<i>num data blocks</i>-<i>num parity blocks</i>-<i>cell size</i>. Currently, five built-in policies are supported: <code>RS-3-2-1024k</code>, <code>RS-6-3-1024k</code>, <code>RS-10-4-1024k</code>, <code>RS-LEGACY-6-3-1024k</code>, <code>XOR-2-1-1024k</code>.</p>
<p>The default <code>REPLICATION</code> scheme is also supported. It can only be set on directory, to force the directory to adopt 3x replication scheme, instead of inheriting its ancestor&#x2019;s erasure coding policy. This policy makes it possible to interleave 3x replication scheme directory with erasure coding directory.</p>
<p><code>REPLICATION</code> is always enabled. Out of all the EC policies, RS(6,3) is enabled by default.</p>
<p>Similar to HDFS storage policies, erasure coding policies are set on a directory. When a file is created, it inherits the EC policy of its nearest ancestor directory.</p>
<p>Directory-level EC policies only affect new files created within the directory. Once a file has been created, its erasure coding policy can be queried but not changed. If an erasure coded file is renamed to a directory with a different EC policy, the file retains its existing EC policy. Converting a file to a different EC policy requires rewriting its data; do this by copying the file (e.g. via distcp) rather than renaming it.</p>
<p>We allow users to define their own EC policies via an XML file, which must have the following three parts:</p>
<ol style="list-style-type: decimal">
<li>
<p><i>layoutversion:</i> This indicates the version of EC policy XML file format.</p>
</li>
<li>
<p><i>schemas:</i> This includes all the user defined EC schemas.</p>
</li>
<li>
<p><i>policies:</i> This includes all the user defined EC policies, and each policy consists of schema id and the size of a striping cell (cellsize).</p>
</li>
</ol>
<p>A sample EC policy XML file named user_ec_policies.xml.template is in the Hadoop conf directory, which user can reference.</p>
</li>
<li>
<p><b>Intel ISA-L</b> Intel ISA-L stands for Intel Intelligent Storage Acceleration Library. ISA-L is an open-source collection of optimized low-level functions designed for storage applications. It includes fast block Reed-Solomon type erasure codes optimized for Intel AVX and AVX2 instruction sets. HDFS erasure coding can leverage ISA-L to accelerate encoding and decoding calculation. ISA-L supports most major operating systems, including Linux and Windows. ISA-L is not enabled by default. See the instructions below for how to enable ISA-L.</p>
</li>
</ul></section><section>
<h2><a name="Deployment"></a>Deployment</h2><section>
<h3><a name="Cluster_and_hardware_configuration"></a>Cluster and hardware configuration</h3>
<p>Erasure coding places additional demands on the cluster in terms of CPU and network.</p>
<p>Encoding and decoding work consumes additional CPU on both HDFS clients and DataNodes.</p>
<p>Erasure coding requires a minimum of as many DataNodes in the cluster as the configured EC stripe width. For EC policy RS (6,3), this means a minimum of 9 DataNodes.</p>
<p>Erasure coded files are also spread across racks for rack fault-tolerance. This means that when reading and writing striped files, most operations are off-rack. Network bisection bandwidth is thus very important.</p>
<p>For rack fault-tolerance, it is also important to have enough number of racks, so that on average, each rack holds number of blocks no more than the number of EC parity blocks. A formula to calculate this would be (data blocks + parity blocks) / parity blocks, rounding up. For EC policy RS (6,3), this means minimally 3 racks (calculated by (6 + 3) / 3 = 3), and ideally 9 or more to handle planned and unplanned outages. For clusters with fewer racks than the number of the parity cells, HDFS cannot maintain rack fault-tolerance, but will still attempt to spread a striped file across multiple nodes to preserve node-level fault-tolerance. For this reason, it is recommended to setup racks with similar number of DataNodes.</p></section><section>
<h3><a name="Configuration_keys"></a>Configuration keys</h3>
<p>By default, all built-in erasure coding policies are disabled, except the one defined in <code>dfs.namenode.ec.system.default.policy</code> which is enabled by default. The cluster administrator can enable set of policies through <code>hdfs ec [-enablePolicy -policy &lt;policyName&gt;]</code> command based on the size of the cluster and the desired fault-tolerance properties. For instance, for a cluster with 9 racks, a policy like <code>RS-10-4-1024k</code> will not preserve rack-level fault-tolerance, and <code>RS-6-3-1024k</code> or <code>RS-3-2-1024k</code> might be more appropriate. If the administrator only cares about node-level fault-tolerance, <code>RS-10-4-1024k</code> would still be appropriate as long as there are at least 14 DataNodes in the cluster.</p>
<p>A system default EC policy can be configured via &#x2018;dfs.namenode.ec.system.default.policy&#x2019; configuration. With this configuration, the default EC policy will be used when no policy name is passed as an argument in the &#x2018;-setPolicy&#x2019; command.</p>
<p>By default, the &#x2018;dfs.namenode.ec.system.default.policy&#x2019; is &#x201c;RS-6-3-1024k&#x201d;.</p>
<p>The codec implementations for Reed-Solomon and XOR can be configured with the following client and DataNode configuration keys: <code>io.erasurecode.codec.rs.rawcoders</code> for the default RS codec, <code>io.erasurecode.codec.rs-legacy.rawcoders</code> for the legacy RS codec, <code>io.erasurecode.codec.xor.rawcoders</code> for the XOR codec. User can also configure self-defined codec with configuration key like: <code>io.erasurecode.codec.self-defined-codec.rawcoders</code>. The values for these key are lists of coder names with a fall-back mechanism. These codec factories are loaded in the order specified by the configuration values, until a codec is loaded successfully. The default RS and XOR codec configuration prefers native implementation over the pure Java one. There is no RS-LEGACY native codec implementation so the default is pure Java implementation only. All these codecs have implementations in pure Java. For default RS codec, there is also a native implementation which leverages Intel ISA-L library to improve the performance of codec. For XOR codec, a native implementation which leverages Intel ISA-L library to improve the performance of codec is also supported. Please refer to section &#x201c;Enable Intel ISA-L&#x201d; for more detail information. The default implementation for RS Legacy is pure Java, and the default implementations for default RS and XOR are native implementations using Intel ISA-L library.</p>
<p>Erasure coding background recovery work on the DataNodes can also be tuned via the following configuration parameters:</p>
<ol style="list-style-type: decimal">
<li><code>dfs.datanode.ec.reconstruction.stripedread.timeout.millis</code> - Timeout for striped reads. Default value is 5000 ms.</li>
<li><code>dfs.datanode.ec.reconstruction.stripedread.buffer.size</code> - Buffer size for reader service. Default value is 64KB.</li>
<li><code>dfs.datanode.ec.reconstruction.threads</code> - Number of threads used by the Datanode for background reconstruction work. Default value is 8 threads.</li>
<li><code>dfs.datanode.ec.reconstruction.xmits.weight</code> - Relative weight of xmits used by EC background recovery task comparing to replicated block recovery. Default value is 0.5. It sets to <code>0</code> to disable calculate weights for EC recovery tasks, that is, EC task always has <code>1</code> xmits. The xmits of an erasure coding recovery task is calculated as the maximum value between the number of read streams and the number of write streams. For example, if an EC recovery task need to read from 6 nodes and write to 2 nodes, it has xmits of <code>max(6, 2) * 0.5 = 3</code>. Recovery task for replicated file always counts as <code>1</code> xmit. NameNode utilizes <code>dfs.namenode.replication.max-streams</code> minus the total <code>xmitsInProgress</code> on the DataNode that combines of the xmits from replicated file and EC files, to schedule recovery tasks to this DataNode.</li>
</ol></section><section>
<h3><a name="Enable_Intel_ISA-L"></a>Enable Intel ISA-L</h3>
<p>HDFS native implementation of default RS codec leverages Intel ISA-L library to improve the encoding and decoding calculation. To enable and use Intel ISA-L, there are three steps.</p>
<ol style="list-style-type: decimal">
<li>Build ISA-L library. Please refer to the official site &#x201c;<a class="externalLink" href="https://github.com/01org/isa-l/">https://github.com/01org/isa-l/</a>&#x201d; for detail information.</li>
<li>Build Hadoop with ISA-L support. Please refer to &#x201c;Intel ISA-L build options&#x201d; section in &#x201c;Build instructions for Hadoop&#x201d; in (BUILDING.txt) in the source code.</li>
<li>Use <code>-Dbundle.isal</code> to copy the contents of the <code>isal.lib</code> directory into the final tar file. Deploy Hadoop with the tar file. Make sure ISA-L is available on HDFS clients and DataNodes.</li>
</ol>
<p>To verify that ISA-L is correctly detected by Hadoop, run the <code>hadoop checknative</code> command.</p></section><section>
<h3><a name="Administrative_commands"></a>Administrative commands</h3>
<p>HDFS provides an <code>ec</code> subcommand to perform administrative commands related to erasure coding.</p>
<div class="source">
<div class="source">
<pre> hdfs ec [generic options]
[-setPolicy -path &lt;path&gt; [-policy &lt;policyName&gt;] [-replicate]]
[-getPolicy -path &lt;path&gt;]
[-unsetPolicy -path &lt;path&gt;]
[-listPolicies]
[-addPolicies -policyFile &lt;file&gt;]
[-listCodecs]
[-enablePolicy -policy &lt;policyName&gt;]
[-disablePolicy -policy &lt;policyName&gt;]
[-removePolicy -policy &lt;policyName&gt;]
[-verifyClusterSetup -policy &lt;policyName&gt;...&lt;policyName&gt;]
[-help [cmd ...]]
</pre></div></div>
<p>Below are the details about each command.</p>
<ul>
<li>
<p><code>[-setPolicy -path &lt;path&gt; [-policy &lt;policyName&gt;] [-replicate]]</code></p>
<p>Sets an erasure coding policy on a directory at the specified path.</p>
<p><code>path</code>: An directory in HDFS. This is a mandatory parameter. Setting a policy only affects newly created files, and does not affect existing files.</p>
<p><code>policyName</code>: The erasure coding policy to be used for files under this directory. This parameter can be omitted if a &#x2018;dfs.namenode.ec.system.default.policy&#x2019; configuration is set. The EC policy of the path will be set with the default value in configuration.</p>
<p><code>-replicate</code> apply the default <code>REPLICATION</code> scheme on the directory, force the directory to adopt 3x replication scheme.</p>
<p><code>-replicate</code> and <code>-policy &lt;policyName&gt;</code> are optional arguments. They cannot be specified at the same time.</p>
</li>
<li>
<p><code>[-getPolicy -path &lt;path&gt;]</code></p>
<p>Get details of the erasure coding policy of a file or directory at the specified path.</p>
</li>
<li>
<p><code>[-unsetPolicy -path &lt;path&gt;]</code></p>
<p>Unset an erasure coding policy set by a previous call to <code>setPolicy</code> on a directory. If the directory inherits the erasure coding policy from an ancestor directory, <code>unsetPolicy</code> is a no-op. Unsetting the policy on a directory which doesn&#x2019;t have an explicit policy set will not return an error.</p>
</li>
<li>
<p><code>[-listPolicies]</code></p>
<p>Lists all (enabled, disabled and removed) erasure coding policies registered in HDFS. Only the enabled policies are suitable for use with the <code>setPolicy</code> command.</p>
</li>
<li>
<p><code>[-addPolicies -policyFile &lt;file&gt;]</code></p>
<p>Add a list of user defined erasure coding policies. Please refer etc/hadoop/user_ec_policies.xml.template for the example policy file. The maximum cell size is defined in property &#x2018;dfs.namenode.ec.policies.max.cellsize&#x2019; with the default value 4MB. Currently HDFS allows the user to add 64 policies in total, and the added policy ID is in range of 64 to 127. Adding policy will fail if there are already 64 policies added.</p>
</li>
<li>
<p><code>[-listCodecs]</code></p>
<p>Get the list of supported erasure coding codecs and coders in system. A coder is an implementation of a codec. A codec can have different implementations, thus different coders. The coders for a codec are listed in a fall back order.</p>
</li>
<li>
<p><code>[-removePolicy -policy &lt;policyName&gt;]</code></p>
<p>Remove an user defined erasure coding policy.</p>
</li>
<li>
<p><code>[-enablePolicy -policy &lt;policyName&gt;]</code></p>
<p>Enable an erasure coding policy.</p>
</li>
<li>
<p><code>[-disablePolicy -policy &lt;policyName&gt;]</code></p>
<p>Disable an erasure coding policy.</p>
</li>
<li>
<p><code>[-verifyClusterSetup -policy &lt;policyName&gt;...&lt;policyName&gt;]</code></p>
<p>Verify if the cluster setup can support all enabled erasure coding policies. If optional parameter -policy is specified, verify if the cluster setup can support the given policy or policies.</p>
</li>
</ul></section></section><section>
<h2><a name="Limitations"></a>Limitations</h2>
<p>Certain HDFS operations, i.e., <code>hflush</code>, <code>hsync</code>, <code>concat</code>, <code>setReplication</code>, <code>truncate</code> and <code>append</code>, are not supported on erasure coded files due to substantial technical challenges.</p>
<ul>
<li><code>append()</code> and <code>truncate()</code> on an erasure coded file will throw <code>IOException</code>. However, with <code>NEW_BLOCK</code> flag enabled, append to a closed striped file is supported.</li>
<li><code>concat()</code> will throw <code>IOException</code> if files are mixed with different erasure coding policies or with replicated files.</li>
<li><code>setReplication()</code> is no-op on erasure coded files.</li>
<li><code>hflush()</code> and <code>hsync()</code> on <code>DFSStripedOutputStream</code> are no-op. Thus calling <code>hflush()</code> or <code>hsync()</code> on an erasure coded file can not guarantee data being persistent.</li>
</ul>
<p>A client can use <a href="../hadoop-common/filesystem/filesystem.html#interface_StreamCapabilities"><code>StreamCapabilities</code></a> API to query whether a <code>OutputStream</code> supports <code>hflush()</code> and <code>hsync()</code>. If the client desires data persistence via <code>hflush()</code> and <code>hsync()</code>, the current remedy is creating such files as regular 3x replication files in a non-erasure-coded directory, or using <code>FSDataOutputStreamBuilder#replicate()</code> API to create 3x replication files in an erasure-coded directory.</p></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>