HDFS-9088. Cleanup erasure coding documentation. Contributed by Andrew Wang.

Change-Id: Ic3ec1f29fef0e27c46fff66fd28a51f8c4c61e71
This commit is contained in:
Zhe Zhang 2015-09-17 09:56:32 -07:00
parent ced438a4bf
commit e36129b61a
2 changed files with 57 additions and 68 deletions

View File

@ -427,3 +427,5 @@
HDFS-8899. Erasure Coding: use threadpool for EC recovery tasks on DataNode.
(Rakesh R via zhz)
HDFS-9088. Cleanup erasure coding documentation. (wang via zhz)

View File

@ -19,108 +19,95 @@ HDFS Erasure Coding
* [Purpose](#Purpose)
* [Background](#Background)
* [Architecture](#Architecture)
* [Hardware resources](#Hardware_resources)
* [Deployment](#Deployment)
* [Configuration details](#Configuration_details)
* [Deployment details](#Deployment_details)
* [Cluster and hardware configuration](#Cluster_and_hardware_configuration)
* [Configuration keys](#Configuration_keys)
* [Administrative commands](#Administrative_commands)
Purpose
-------
Replication is expensive -- the default 3x replication scheme has 200% overhead in storage space and other resources (e.g., network bandwidth).
However, for “warm” and “cold” datasets with relatively low I/O activities, secondary block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the primary ones.
Replication is expensive -- the default 3x replication scheme in HDFS has 200% overhead in storage space and other resources (e.g., network bandwidth).
However, for warm and cold datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica.
Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication, which provides the same level of fault tolerance with much less storage space. In typical Erasure Coding(EC) setups, the storage overhead is ≤ 50%.
Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication, which provides the same level of fault-tolerance with much less storage space. In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%.
Background
----------
In storage systems, the most notable usage of EC is Redundant Array of Inexpensive Disks (RAID). RAID implements EC through striping, which divides logically sequential data (such as a file) into smaller units (such as bit, byte, or block) and stores consecutive units on different disks. In the rest of this guide this unit of striping distribution is termed a striping cell (or cell). For each stripe of original data cells, a certain number of parity cells are calculated and stored -- the process of which is called encoding. The error on any striping cell can be recovered through decoding calculation based on surviving data and parity cells.
Integrating the EC function with HDFS could get storage efficient deployments. It can provide similar data tolerance as traditional HDFS replication based deployments but it stores only one original replica data and parity cells.
In a typical case, A file with 6 blocks will actually be consume space of 6*3 = 18 blocks with replication factor 3. But with EC (6 data,3 parity) deployment, it will only consume space of 9 blocks.
Integrating EC with HDFS can improve storage efficiency while still providing similar data durability as traditional replication-based HDFS deployments.
As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks of disk space.
Architecture
------------
In the context of EC, striping has several critical advantages. First, it enables online EC which bypasses the conversion phase and immediately saves storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk spindles in parallel; this is especially desirable in clusters with high end networking . Second, it naturally distributes a small file to multiple DataNodes and eliminates the need to bundle multiple files into a single coding group. This greatly simplifies file operations such as deletion, quota reporting, and migration between federated namespaces.
In the context of EC, striping has several critical advantages. First, it enables online EC (writing data immediately in EC format), avoiding a conversion phase and immediately saving storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk spindles in parallel; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodes and eliminates the need to bundle multiple files into a single coding group. This greatly simplifies file operations such as deletion, quota reporting, and migration between federated namespaces.
As in general HDFS clusters, small files could account for over 3/4 of total storage consumption. So, In this first phase of erasure coding work, HDFS supports striping model. In the near future, HDFS will supports contiguous layout as second second phase work. So this guide focuses more on striping model EC.
In typical HDFS clusters, small files can account for over 3/4 of total storage consumption. To better support small files, in this first phase of work HDFS supports EC with striping. In the future, HDFS will also support a contiguous EC layout. See the design doc and discussion on [HDFS-7285](https://issues.apache.org/jira/browse/HDFS-7285) for more information.
* **NameNode Extensions** - Under the striping layout, a HDFS file is logically composed of block groups, each of which contains a certain number of internal blocks.
To eliminate the need for NameNode to monitor all internal blocks, a new hierarchical block naming protocol is introduced, where the ID of a block group can be inferred from any of its internal blocks. This allows each block group to be managed as a new type of BlockInfo named BlockInfoStriped, which tracks its own internal blocks by attaching an index to each replica location.
* **NameNode Extensions** - Striped HDFS files are logically composed of block groups, each of which contains a certain number of internal blocks.
To reduce NameNode memory consumption from these additional blocks, a new hierarchical block naming protocol was introduced. The ID of a block group can be inferred from the ID of any of its internal blocks. This allows management at the level of the block group rather than the block.
* **Client Extensions** - The basic principle behind the extensions is to allow the client node to work on multiple internal blocks in a block group in
parallel.
* **Client Extensions** - The client read and write paths were enhanced to work on multiple internal blocks in a block group in parallel.
On the output / write path, DFSStripedOutputStream manages a set of data streamers, one for each DataNode storing an internal block in the current block group. The streamers mostly
work asynchronously. A coordinator takes charge of operations on the entire block group, including ending the current block group, allocating a new block group, and so forth.
On the input / read path, DFSStripedInputStream translates a requested logical byte range of data as ranges into internal blocks stored on DataNodes. It then issues read requests in
parallel. Upon failures, it issues additional read requests for decoding.
* **DataNode Extensions** - ErasureCodingWorker(ECWorker) is for reconstructing erased erasure coding blocks and runs along with the Datanode process. Erased block details would have been found out by Namenode ReplicationMonitor thread and sent to Datanode via its heartbeat responses as discussed in the previous sections. For each reconstruction task,
i.e. ReconstructAndTransferBlock, it will start an internal daemon thread that performs 3 key tasks:
* **DataNode Extensions** - The DataNode runs an additional ErasureCodingWorker (ECWorker) task for background recovery of failed erasure coded blocks. Failed EC blocks are detected by the NameNode, which then chooses a DataNode to do the recovery work. The recovery task is passed as a heartbeat response. This process is similar to how replicated blocks are re-replicated on failure. Reconstruction performs three key tasks:
_1.Read the data from source nodes:_ For reading the data blocks from different source nodes, it uses a dedicated thread pool.
The thread pool is initialized when ErasureCodingWorker initializes. Based on the EC policy, it schedules the read requests to all source targets and ensures only to read
minimum required input blocks for reconstruction.
1. _Read the data from source nodes:_ Input data is read in parallel from source nodes using a dedicated thread pool.
Based on the EC policy, it schedules the read requests to all source targets and reads only the minimum number of input blocks for reconstruction.
_2.Decode the data and generate the output data:_ Actual decoding/encoding is done by using RawErasureEncoder API currently.
All the erased data and/or parity blocks will be recovered together.
1. _Decode the data and generate the output data:_ New data and parity blocks are decoded from the input data. All missing data and parity blocks are decoded together.
_3.Transfer the generated data blocks to target nodes:_ Once decoding is finished, it will encapsulate the output data to packets and send them to
target Datanodes.
To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster to have different replication and EC policies.
* **ErasureCodingPolicy**
1. _Transfer the generated data blocks to target nodes:_ Once decoding is finished, the recovered blocks are transferred to target DataNodes.
* **ErasureCoding policy**
To accommodate heterogeneous workloads, we allow files and directories in an HDFS cluster to have different replication and EC policies.
Information on how to encode/decode a file is encapsulated in an ErasureCodingPolicy class. Each policy is defined by the following 2 pieces of information:
_1.The ECScema: This includes the numbers of data and parity blocks in an EC group (e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon).
_2.The size of a striping cell.
1. _The ECSchema:_ This includes the numbers of data and parity blocks in an EC group (e.g., 6+3), as well as the codec algorithm (e.g., Reed-Solomon).
Client and Datanode uses EC codec framework directly for doing the endoing/decoding work.
1. _The size of a striping cell._ This determines the granularity of striped reads and writes, including buffer sizes and encoding work.
* **Erasure Codec Framework**
We support a generic EC framework which allows system users to define, configure, and deploy multiple coding schemas such as conventional Reed-Solomon, HitchHicker and
so forth.
ErasureCoder is provided to encode or decode for a block group in the middle level, and RawErasureCoder is provided to perform the concrete algorithm calculation in the low level. ErasureCoder can
combine and make use of different RawErasureCoders for tradeoff. We abstracted coder type, data blocks size, parity blocks size into ECSchema. A default system schema using RS (6, 3) is built-in.
For the system default codec Reed-Solomon we implemented both RSRawErasureCoder in pure Java and NativeRawErasureCoder based on Intel ISA-L. Below is the performance
comparing for different coding chunk size. We can see that the native coder can outperform the Java coder by up to 35X.
Currently, HDFS supports the Reed-Solomon and XOR erasure coding algorithms. Additional algorithms are planned as future work.
The system default scheme is Reed-Solomon (6, 3) with a cell size of 64KB.
_Intel® Storage Acceleration-Library(Intel® ISA-L)_ ISA-L is an Open Source Version and is a collection of low-level functions used in storage applications.
The open source version contains fast erasure codes that implement a general Reed-Solomon type encoding for blocks of data that helps protect against
erasure of whole blocks. The general ISA-L library contains an expanded set of functions used for data protection, hashing, encryption, etc. By
leveraging instruction sets like SSE, AVX and AVX2, the erasure coding functions are much optimized and outperform greatly on IA platforms. ISA-L
supports Linux, Windows and other platforms as well. Additionally, it also supports incremental coding so applications dont have to wait all source
blocks to be available before to perform the coding, which can be used in HDFS.
Hardware resources
------------------
For using EC feature, you need to prepare for the following.
Depending on the ECSchemas used, we need to have minimum number of Datanodes available in the cluster. Example if we use ReedSolomon(6, 3) ECSchema,
then minimum nodes required is 9 to succeed the write. It can tolerate up to 3 failures.
Deployment
----------
### Configuration details
### Cluster and hardware configuration
In the EC feature, raw coders are configurable. So, users need to decide the RawCoder algorithms.
Configure the customized algorithms with configuration key "*io.erasurecode.codecs*".
Erasure coding places additional demands on the cluster in terms of CPU and network.
Default Reed-Solomon based raw coders available in built, which can be configured by using the configuration key "*io.erasurecode.codec.rs.rawcoder*".
And also another default raw coder available if XOR based raw coder. Which could be configured by using "*io.erasurecode.codec.xor.rawcoder*"
Encoding and decoding work consumes additional CPU on both HDFS clients and DataNodes.
_EarasureCodingWorker Confugurations:_
dfs.datanode.stripedread.threshold.millis - Threshold time for polling timeout for read service. Default value is 5000
dfs.datanode.stripedread.threads Number striped read thread pool threads. Default value is 20
dfs.datanode.stripedread.buffer.size - Buffer size for reader service. Default value is 256 * 1024
Erasure coded files are also spread across racks for rack fault-tolerance.
This means that when reading and writing striped files, most operations are off-rack.
Network bisection bandwidth is thus very important.
### Deployment details
For rack fault-tolerance, it is also important to have at least as many racks as the configured EC stripe width.
For the default EC policy of RS (6,3), this means minimally 9 racks, and ideally 10 or 11 to handle planned and unplanned outages.
For clusters with fewer racks than the stripe width, HDFS cannot maintain rack fault-tolerance, but will still attempt
to spread a striped file across multiple nodes to preserve node-level fault-tolerance.
With the striping model, client machine is responsible for do the EC endoing and tranferring data to the datanodes.
So, EC with striping model expects client machines with hghg end configurations especially of CPU and network.
### Configuration keys
The codec implementation for Reed-Solomon and XOR can be configured with the following client and DataNode configuration keys:
`io.erasurecode.codec.rs.rawcoder` and `io.erasurecode.codec.xor.rawcoder`.
The default implementations for both of these codecs are pure Java.
Erasure coding background recovery work on the DataNodes can also be tuned via the following configuration parameters:
1. `dfs.datanode.stripedread.threshold.millis` - Timeout for striped reads. Default value is 5000 ms.
1. `dfs.datanode.stripedread.threads` - Number of concurrent reader threads. Default value is 20 threads.
1. `dfs.datanode.stripedread.buffer.size` - Buffer size for reader service. Default value is 256KB.
### Administrative commands
ErasureCoding command-line is provided to perform administrative commands related to ErasureCoding. This can be accessed by executing the following command.
HDFS provides an `erasurecode` subcommand to perform administrative commands related to erasure coding.
hdfs erasurecode [generic options]
[-setPolicy [-s <policyName>] <path>]
@ -131,18 +118,18 @@ Deployment
Below are the details about each command.
* **SetPolicy command**: `[-setPolicy [-s <policyName>] <path>]`
* `[-setPolicy [-s <policyName>] <path>]`
SetPolicy command is used to set an ErasureCoding policy on a directory at the specified path.
Sets an ErasureCoding policy on a directory at the specified path.
`path`: Refer to a pre-created directory in HDFS. This is a mandatory parameter.
`path`: An directory in HDFS. This is a mandatory parameter. Setting a policy only affects newly created files, and does not affect existing files.
`policyName`: This is an optional parameter, specified using -s flag. Refer to the name of ErasureCodingPolicy to be used for encoding files under this directory. If not specified the system default ErasureCodingPolicy will be used.
`policyName`: The ErasureCoding policy to be used for files under this directory. This is an optional parameter, specified using -s flag. If no policy is specified, the system default ErasureCodingPolicy will be used.
* **GetPolicy command**: `[-getPolicy <path>]`
* `[-getPolicy <path>]`
GetPolicy command is used to get details of the ErasureCoding policy of a file or directory at the specified path.
Get details of the ErasureCoding policy of a file or directory at the specified path.
* **ListPolicies command**: `[-listPolicies]`
* `[-listPolicies]`
Lists all supported ErasureCoding policies. For setPolicy command, one of these policies' name should be provided.
Lists all supported ErasureCoding policies. These names are suitable for use with the `setPolicy` command.