HADOOP-14742. Document multi-URI replication Inode for ViewFS. Contributed by Gera Shegalov
(cherry picked from commit ddb67ca707
)
This commit is contained in:
parent
2bda1ffe72
commit
c9364b3bce
|
@ -180,6 +180,145 @@ Recall that one cannot rename files or directories across namenodes or clusters
|
||||||
|
|
||||||
This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster.
|
This will NOT work in the new world if `/user` and `/data` are actually stored on different namenodes within a cluster.
|
||||||
|
|
||||||
|
Multi-Filesystem I/0 with Nfly Mount Points
|
||||||
|
-----------------
|
||||||
|
|
||||||
|
HDFS and other distributed filesystems provide data resilience via some sort of
|
||||||
|
redundancy such as block replication or more sophisticated distributed encoding.
|
||||||
|
However, modern setups may be comprised of multiple Hadoop clusters, enterprise
|
||||||
|
filers, hosted on and off premise. Nfly mount points make it possible for a
|
||||||
|
single logical file to be synchronously replicated by multiple filesystems.
|
||||||
|
It's designed for a relatively small files up to a gigabyte. In general it's a
|
||||||
|
function of a single core/single network link performance since the logic
|
||||||
|
resides in a single client JVM using ViewFs such as FsShell or a
|
||||||
|
MapReduce task.
|
||||||
|
|
||||||
|
### Basic Configuration
|
||||||
|
|
||||||
|
Consider the following example to understand the basic configuration of Nfly.
|
||||||
|
Suppose we want to keep the directory `ads` replicated on three filesystems
|
||||||
|
represented by URIs: `uri1`, `uri2` and `uri3`.
|
||||||
|
|
||||||
|
```xml
|
||||||
|
<property>
|
||||||
|
<name>fs.viewfs.mounttable.global.linkNfly../ads</name>
|
||||||
|
<value>uri1,uri2,uri3</value>
|
||||||
|
</property>
|
||||||
|
```
|
||||||
|
Note 2 consecutive `..` in the property name. They arise because of empty
|
||||||
|
settings for advanced tweaking of the mount point which we will show in
|
||||||
|
subsequent sections. The property value is a comma-separated list of URIs.
|
||||||
|
|
||||||
|
URIs may point to different clusters in different regions
|
||||||
|
`hdfs://datacenter-east/ads`, `s3a://models-us-west/ads`, `hdfs://datacenter-west/ads`
|
||||||
|
or in the simplest case to different directories under the same filesystem,
|
||||||
|
e.g., `file:/tmp/ads1`, `file:/tmp/ads2`, `file:/tmp/ads3`
|
||||||
|
|
||||||
|
All *modifications* performed under the global path `viewfs://global/ads` are
|
||||||
|
propagated to all destination URIs if the underlying system is available.
|
||||||
|
|
||||||
|
For instance if we create a file via hadoop shell
|
||||||
|
```bash
|
||||||
|
hadoop fs -touchz viewfs://global/ads/z1
|
||||||
|
```
|
||||||
|
|
||||||
|
We will find it via local filesystem in the latter configuration
|
||||||
|
```bash
|
||||||
|
ls -al /tmp/ads*/z1
|
||||||
|
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads1/z1
|
||||||
|
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads2/z1
|
||||||
|
-rw-r--r-- 1 user wheel 0 Mar 11 12:17 /tmp/ads3/z1
|
||||||
|
```
|
||||||
|
|
||||||
|
A read from the global path is processed by the first filesystem that does not
|
||||||
|
result in an exception. The order in which filesystems are accessed depends on
|
||||||
|
whether they are available at this moment or and whether a topological order
|
||||||
|
exists.
|
||||||
|
|
||||||
|
### Advanced Configuration
|
||||||
|
|
||||||
|
Mount points `linkNfly` can be further configured using parameters passed as a
|
||||||
|
comma-separated list of key=value pairs. Following parameters are currently
|
||||||
|
supported.
|
||||||
|
|
||||||
|
`minReplication=int` determines the minimum number of destinations that have to
|
||||||
|
process a write modification without exceptions, if below nfly write is failed.
|
||||||
|
It is an configuration error to have minReplication higher than the number of
|
||||||
|
target URIs. The default is 2.
|
||||||
|
|
||||||
|
If minReplication is lower than the number of target URIs we may have some
|
||||||
|
target URIs without latest writes. It can be compensated by employing more
|
||||||
|
expensive read operations controlled by the following settings
|
||||||
|
|
||||||
|
`readMostRecent=boolean` if set to `true` causes Nfly client to check the path
|
||||||
|
under all target URIs instead of just the first one based on the topology order.
|
||||||
|
Among all available at the moment the one with the most recent modification time
|
||||||
|
is processed.
|
||||||
|
|
||||||
|
`repairOnRead=boolean` if set to `true` causes Nfly to copy most recent replica
|
||||||
|
to stale targets such that subsequent reads can be done cheaply again from the
|
||||||
|
closest replica.
|
||||||
|
|
||||||
|
### Network Topology
|
||||||
|
|
||||||
|
Nfly seeks to satisfy reads from the "closest" target URI.
|
||||||
|
|
||||||
|
To this end, Nfly extends the notion of
|
||||||
|
<a href="hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
|
||||||
|
to the authorities of target URIs.
|
||||||
|
|
||||||
|
Nfly applies NetworkTopology to resolve authorities of the URIs. Most commonly
|
||||||
|
a script based mapping is used in a heterogeneous setup. We could have a script
|
||||||
|
providing the following topology mapping
|
||||||
|
|
||||||
|
| URI | Topology |
|
||||||
|
|-------------------------------|------------------------- |
|
||||||
|
| `hdfs://datacenter-east/ads` | /us-east/onpremise-hdfs |
|
||||||
|
| `s3a://models-us-west/ads` | /us-west/aws |
|
||||||
|
| `hdfs://datacenter-west/ads` | /us-west/onpremise-hdfs |
|
||||||
|
|
||||||
|
|
||||||
|
If a target URI does not have the authority part as in `file:/` Nfly injects
|
||||||
|
client's local node name.
|
||||||
|
|
||||||
|
### Example Nfly Configuration
|
||||||
|
|
||||||
|
```xml
|
||||||
|
<property>
|
||||||
|
<name>fs.viewfs.mounttable.global.linkNfly.minReplication=3,readMostRecent=true,repairOnRead=false./ads</name>
|
||||||
|
<value>hdfs://datacenter-east/ads,hdfs://datacenter-west/ads,s3a://models-us-west/ads,file:/tmp/ads</value>
|
||||||
|
</property>
|
||||||
|
```
|
||||||
|
|
||||||
|
### How Nfly File Creation works
|
||||||
|
|
||||||
|
```java
|
||||||
|
FileSystem fs = FileSystem.get("viewfs://global/", ...);
|
||||||
|
FSDataOutputStream out = fs.create("viewfs://global/ads/f1");
|
||||||
|
out.write(...);
|
||||||
|
out.close();
|
||||||
|
```
|
||||||
|
The code above would result in the following execution.
|
||||||
|
|
||||||
|
1. create an invisible file `_nfly_tmp_f1` under each target URI i.e.,
|
||||||
|
`hdfs://datacenter-east/ads/_nfly_tmp_f1`, `hdfs://datacenter-west/ads/_nfly_tmp_f1`, etc.
|
||||||
|
This is done by calling `create` on underlying filesystems and returns a
|
||||||
|
`FSDataOutputStream` object `out` that wraps all four output streams.
|
||||||
|
|
||||||
|
2. Thus each subsequent write on `out` can be forwarded to each wrapped stream.
|
||||||
|
|
||||||
|
3. On `out.close` all streams are closed, and the files are renamed from `_nfly_tmp_f1` to `f1`.
|
||||||
|
All files receive the same *modification time* corresponding to the client
|
||||||
|
system time as of beginning of this step.
|
||||||
|
|
||||||
|
4. If at least `minReplication` destinations have gone through steps 1-3 without
|
||||||
|
failures Nfly considers the transaction logically committed; Otherwise it tries
|
||||||
|
to clean up the temporary files in a best-effort attempt.
|
||||||
|
|
||||||
|
Note that because 4 is a best-effort step and the client JVM could crash and never
|
||||||
|
resume its work, it's a good idea to provision some sort of cron job to purge such
|
||||||
|
`_nfly_tmp` files.
|
||||||
|
|
||||||
### FAQ
|
### FAQ
|
||||||
|
|
||||||
1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**
|
1. **As I move from non-federated world to the federated world, I will have to keep track of namenodes for different volumes; how do I do that?**
|
||||||
|
|
Loading…
Reference in New Issue