HADOOP-17480. Document that AWS S3 is consistent and that S3Guard is not needed (#2636)
Contributed by Steve Loughran. Change-Id: I775e3ee7b60665240ec621859c337b053f747a49
This commit is contained in:
parent
886b245de6
commit
bd85f6acea
|
@ -20,6 +20,27 @@ This document covers the architecture and implementation details of the S3A comm
|
||||||
|
|
||||||
For information on using the committers, see [the S3A Committers](./committer.html).
|
For information on using the committers, see [the S3A Committers](./committer.html).
|
||||||
|
|
||||||
|
### January 2021 Update
|
||||||
|
|
||||||
|
Now that S3 is fully consistent, problems related to inconsistent
|
||||||
|
directory listings have gone. However the rename problem exists: committing
|
||||||
|
work by renaming directories is unsafe as well as horribly slow.
|
||||||
|
|
||||||
|
This architecture document, and the committers, were written at a time
|
||||||
|
when S3 was inconsistent. The two committers addressed this problem differently
|
||||||
|
|
||||||
|
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
|
||||||
|
the lists of files to commit from workers to the job manager/driver.
|
||||||
|
* Magic Committer: require S3Guard to offer consistent directory listings
|
||||||
|
on the object store.
|
||||||
|
|
||||||
|
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
|
||||||
|
The choice of which to use, then, is matter for experimentation.
|
||||||
|
|
||||||
|
This architecture document was written in 2017, a time when S3 was only
|
||||||
|
consistent when an extra consistency layer such as S3Guard was used.
|
||||||
|
The document indicates where requirements/constraints which existed then
|
||||||
|
are now obsolete.
|
||||||
|
|
||||||
## Problem: Efficient, reliable commits of work to consistent S3 buckets
|
## Problem: Efficient, reliable commits of work to consistent S3 buckets
|
||||||
|
|
||||||
|
@ -49,10 +70,10 @@ can be executed server-side, but as it does not complete until the in-cluster
|
||||||
copy has completed, it takes time proportional to the amount of data.
|
copy has completed, it takes time proportional to the amount of data.
|
||||||
|
|
||||||
The rename overhead is the most visible issue, but it is not the most dangerous.
|
The rename overhead is the most visible issue, but it is not the most dangerous.
|
||||||
That is the fact that path listings have no consistency guarantees, and may
|
That is the fact that until late 2020, path listings had no consistency guarantees,
|
||||||
lag the addition or deletion of files.
|
and may have lagged the addition or deletion of files.
|
||||||
If files are not listed, the commit operation will *not* copy them, and
|
If files were not listed, the commit operation would *not* copy them, and
|
||||||
so they will not appear in the final output.
|
so they would not appear in the final output.
|
||||||
|
|
||||||
The solution to this problem is closely coupled to the S3 protocol itself:
|
The solution to this problem is closely coupled to the S3 protocol itself:
|
||||||
delayed completion of multi-part PUT operations
|
delayed completion of multi-part PUT operations
|
||||||
|
@ -828,6 +849,8 @@ commit sequence in `Task.done()`, when `talkToAMTGetPermissionToCommit()`
|
||||||
|
|
||||||
# Requirements of an S3A Committer
|
# Requirements of an S3A Committer
|
||||||
|
|
||||||
|
The design requirements of the S3A committer were
|
||||||
|
|
||||||
1. Support an eventually consistent S3 object store as a reliable direct
|
1. Support an eventually consistent S3 object store as a reliable direct
|
||||||
destination of work through the S3A filesystem client.
|
destination of work through the S3A filesystem client.
|
||||||
1. Efficient: implies no rename, and a minimal amount of delay in the job driver's
|
1. Efficient: implies no rename, and a minimal amount of delay in the job driver's
|
||||||
|
@ -841,6 +864,7 @@ the job, and any previous incompleted jobs.
|
||||||
1. Security: not to permit privilege escalation from other users with
|
1. Security: not to permit privilege escalation from other users with
|
||||||
write access to the same file system(s).
|
write access to the same file system(s).
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Features of S3 and the S3A Client
|
## Features of S3 and the S3A Client
|
||||||
|
|
||||||
|
@ -852,8 +876,8 @@ MR committer algorithms have significant performance problems.
|
||||||
|
|
||||||
1. Single-object renames are implemented as a copy and delete sequence.
|
1. Single-object renames are implemented as a copy and delete sequence.
|
||||||
1. COPY is atomic, but overwrites cannot be prevented.
|
1. COPY is atomic, but overwrites cannot be prevented.
|
||||||
1. Amazon S3 is eventually consistent on listings, deletes and updates.
|
1. [Obsolete] Amazon S3 is eventually consistent on listings, deletes and updates.
|
||||||
1. Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
1. [Obsolete] Amazon S3 has create consistency, however, the negative response of a HEAD/GET
|
||||||
performed on a path before an object was created can be cached, unintentionally
|
performed on a path before an object was created can be cached, unintentionally
|
||||||
creating a create inconsistency. The S3A client library does perform such a check,
|
creating a create inconsistency. The S3A client library does perform such a check,
|
||||||
on `create()` and `rename()` to check the state of the destination path, and
|
on `create()` and `rename()` to check the state of the destination path, and
|
||||||
|
@ -872,11 +896,12 @@ data, with the `S3ABlockOutputStream` of HADOOP-13560 uploading written data
|
||||||
as parts of a multipart PUT once the threshold set in the configuration
|
as parts of a multipart PUT once the threshold set in the configuration
|
||||||
parameter `fs.s3a.multipart.size` (default: 100MB).
|
parameter `fs.s3a.multipart.size` (default: 100MB).
|
||||||
|
|
||||||
[S3Guard](./s3guard.html) adds an option of consistent view of the filesystem
|
[S3Guard](./s3guard.html) added an option of consistent view of the filesystem
|
||||||
to all processes using the shared DynamoDB table as the authoritative store of
|
to all processes using the shared DynamoDB table as the authoritative store of
|
||||||
metadata. Some S3-compatible object stores are fully consistent; the
|
metadata.
|
||||||
proposed algorithm is designed to work with such object stores without the
|
The proposed algorithm was designed to work with such object stores without the
|
||||||
need for any DynamoDB tables.
|
need for any DynamoDB tables. Since AWS S3 became consistent in 2020, this
|
||||||
|
means that they will work directly with the store.
|
||||||
|
|
||||||
## Related work: Spark's `DirectOutputCommitter`
|
## Related work: Spark's `DirectOutputCommitter`
|
||||||
|
|
||||||
|
@ -1246,8 +1271,8 @@ for parallel committing of work, including all the error handling based on
|
||||||
the Netflix experience.
|
the Netflix experience.
|
||||||
|
|
||||||
It differs in that it directly streams data to S3 (there is no staging),
|
It differs in that it directly streams data to S3 (there is no staging),
|
||||||
and it also stores the lists of pending commits in S3 too. That mandates
|
and it also stores the lists of pending commits in S3 too. It
|
||||||
consistent metadata on S3, which S3Guard provides.
|
requires a consistent S3 store.
|
||||||
|
|
||||||
|
|
||||||
### Core concept: A new/modified output stream for delayed PUT commits
|
### Core concept: A new/modified output stream for delayed PUT commits
|
||||||
|
@ -1480,7 +1505,7 @@ The time to commit a job will be `O(files/threads)`
|
||||||
Every `.pendingset` file in the job attempt directory must be loaded, and a PUT
|
Every `.pendingset` file in the job attempt directory must be loaded, and a PUT
|
||||||
request issued for every incomplete upload listed in the files.
|
request issued for every incomplete upload listed in the files.
|
||||||
|
|
||||||
Note that it is the bulk listing of all children which is where full consistency
|
[Obsolete] Note that it is the bulk listing of all children which is where full consistency
|
||||||
is required. If instead, the list of files to commit could be returned from
|
is required. If instead, the list of files to commit could be returned from
|
||||||
tasks to the job committer, as the Spark commit protocol allows, it would be
|
tasks to the job committer, as the Spark commit protocol allows, it would be
|
||||||
possible to commit data to an inconsistent object store.
|
possible to commit data to an inconsistent object store.
|
||||||
|
@ -1525,7 +1550,7 @@ commit algorithms.
|
||||||
1. It is possible to create more than one client writing to the
|
1. It is possible to create more than one client writing to the
|
||||||
same destination file within the same S3A client/task, either sequentially or in parallel.
|
same destination file within the same S3A client/task, either sequentially or in parallel.
|
||||||
|
|
||||||
1. Even with a consistent metadata store, if a job overwrites existing
|
1. [Obsolete] Even with a consistent metadata store, if a job overwrites existing
|
||||||
files, then old data may still be visible to clients reading the data, until
|
files, then old data may still be visible to clients reading the data, until
|
||||||
the update has propagated to all replicas of the data.
|
the update has propagated to all replicas of the data.
|
||||||
|
|
||||||
|
@ -1538,7 +1563,7 @@ all files in the destination directory which where not being overwritten.
|
||||||
for any purpose other than for the storage of pending commit data.
|
for any purpose other than for the storage of pending commit data.
|
||||||
|
|
||||||
1. Unless extra code is added to every FS operation, it will still be possible
|
1. Unless extra code is added to every FS operation, it will still be possible
|
||||||
to manipulate files under the `__magic` tree. That's not bad, it just potentially
|
to manipulate files under the `__magic` tree. That's not bad, just potentially
|
||||||
confusing.
|
confusing.
|
||||||
|
|
||||||
1. As written data is not materialized until the commit, it will not be possible
|
1. As written data is not materialized until the commit, it will not be possible
|
||||||
|
@ -1693,14 +1718,6 @@ base for relative paths created underneath it.
|
||||||
|
|
||||||
The committers can only be tested against an S3-compatible object store.
|
The committers can only be tested against an S3-compatible object store.
|
||||||
|
|
||||||
Although a consistent object store is a requirement for a production deployment
|
|
||||||
of the magic committer an inconsistent one has appeared to work during testing, simply by
|
|
||||||
adding some delays to the operations: a task commit does not succeed until
|
|
||||||
all the objects which it has PUT are visible in the LIST operation. Assuming
|
|
||||||
that further listings from the same process also show the objects, the job
|
|
||||||
committer will be able to list and commit the uploads.
|
|
||||||
|
|
||||||
|
|
||||||
The committers have some unit tests, and integration tests based on
|
The committers have some unit tests, and integration tests based on
|
||||||
the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter`
|
the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter`
|
||||||
to test various state transitions of the commit mechanism has been extended
|
to test various state transitions of the commit mechanism has been extended
|
||||||
|
@ -1766,7 +1783,8 @@ tree.
|
||||||
Alternatively, the fact that Spark tasks provide data to the job committer on their
|
Alternatively, the fact that Spark tasks provide data to the job committer on their
|
||||||
completion means that a list of pending PUT commands could be built up, with the commit
|
completion means that a list of pending PUT commands could be built up, with the commit
|
||||||
operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
|
operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
|
||||||
As noted earlier, this may permit the requirement for a consistent list operation
|
|
||||||
|
[Obsolete] As noted earlier, this may permit the requirement for a consistent list operation
|
||||||
to be bypassed. It would still be important to list what was being written, as
|
to be bypassed. It would still be important to list what was being written, as
|
||||||
it is needed to aid aborting work in failed tasks, but the list of files
|
it is needed to aid aborting work in failed tasks, but the list of files
|
||||||
created by successful tasks could be passed directly from the task to committer,
|
created by successful tasks could be passed directly from the task to committer,
|
||||||
|
@ -1890,9 +1908,6 @@ bandwidth and the data upload bandwidth.
|
||||||
|
|
||||||
No use is made of the cluster filesystem; there are no risks there.
|
No use is made of the cluster filesystem; there are no risks there.
|
||||||
|
|
||||||
A consistent store is required, which, for Amazon's infrastructure, means S3Guard.
|
|
||||||
This is covered below.
|
|
||||||
|
|
||||||
A malicious user with write access to the `__magic` directory could manipulate
|
A malicious user with write access to the `__magic` directory could manipulate
|
||||||
or delete the metadata of pending uploads, or potentially inject new work int
|
or delete the metadata of pending uploads, or potentially inject new work int
|
||||||
the commit. Having access to the `__magic` directory implies write access
|
the commit. Having access to the `__magic` directory implies write access
|
||||||
|
@ -1900,13 +1915,12 @@ to the parent destination directory: a malicious user could just as easily
|
||||||
manipulate the final output, without needing to attack the committer's intermediate
|
manipulate the final output, without needing to attack the committer's intermediate
|
||||||
files.
|
files.
|
||||||
|
|
||||||
|
|
||||||
### Security Risks of all committers
|
### Security Risks of all committers
|
||||||
|
|
||||||
|
|
||||||
#### Visibility
|
#### Visibility
|
||||||
|
|
||||||
* If S3Guard is used for storing metadata, then the metadata is visible to
|
[Obsolete] If S3Guard is used for storing metadata, then the metadata is visible to
|
||||||
all users with read access. A malicious user with write access could delete
|
all users with read access. A malicious user with write access could delete
|
||||||
entries of newly generated files, so they would not be visible.
|
entries of newly generated files, so they would not be visible.
|
||||||
|
|
||||||
|
@ -1941,7 +1955,7 @@ any of the text fields, script which could then be executed in some XSS
|
||||||
attack. We may wish to consider sanitizing this data on load.
|
attack. We may wish to consider sanitizing this data on load.
|
||||||
|
|
||||||
* Paths in tampered data could be modified in an attempt to commit an upload across
|
* Paths in tampered data could be modified in an attempt to commit an upload across
|
||||||
an existing file, or the MPU ID alterated to prematurely commit a different upload.
|
an existing file, or the MPU ID altered to prematurely commit a different upload.
|
||||||
These attempts will not going to succeed, because the destination
|
These attempts will not going to succeed, because the destination
|
||||||
path of the upload is declared on the initial POST to initiate the MPU, and
|
path of the upload is declared on the initial POST to initiate the MPU, and
|
||||||
operations associated with the MPU must also declare the path: if the path and
|
operations associated with the MPU must also declare the path: if the path and
|
||||||
|
|
|
@ -26,10 +26,30 @@ and reliable commitment of output to S3.
|
||||||
For details on their internal design, see
|
For details on their internal design, see
|
||||||
[S3A Committers: Architecture and Implementation](./committer_architecture.html).
|
[S3A Committers: Architecture and Implementation](./committer_architecture.html).
|
||||||
|
|
||||||
|
### January 2021 Update
|
||||||
|
|
||||||
|
Now that S3 is fully consistent, problems related to inconsistent directory
|
||||||
|
listings have gone. However the rename problem exists: committing work by
|
||||||
|
renaming directories is unsafe as well as horribly slow.
|
||||||
|
|
||||||
|
This architecture document, and the committers, were written at a time when S3
|
||||||
|
was inconsistent. The two committers addressed this problem differently
|
||||||
|
|
||||||
|
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
|
||||||
|
the lists of files to commit from workers to the job manager/driver.
|
||||||
|
* Magic Committer: require S3Guard to offer consistent directory listings on the
|
||||||
|
object store.
|
||||||
|
|
||||||
|
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
|
||||||
|
The choice of which to use, then, is matter for experimentation.
|
||||||
|
|
||||||
|
This document was written in 2017, a time when S3 was only
|
||||||
|
consistent when an extra consistency layer such as S3Guard was used. The
|
||||||
|
document indicates where requirements/constraints which existed then are now
|
||||||
|
obsolete.
|
||||||
|
|
||||||
## Introduction: The Commit Problem
|
## Introduction: The Commit Problem
|
||||||
|
|
||||||
|
|
||||||
Apache Hadoop MapReduce (and behind the scenes, Apache Spark) often write
|
Apache Hadoop MapReduce (and behind the scenes, Apache Spark) often write
|
||||||
the output of their work to filesystems
|
the output of their work to filesystems
|
||||||
|
|
||||||
|
@ -50,21 +70,18 @@ or it is at the destination, -in which case the rename actually succeeded.
|
||||||
|
|
||||||
**The S3 object store and the `s3a://` filesystem client cannot meet these requirements.*
|
**The S3 object store and the `s3a://` filesystem client cannot meet these requirements.*
|
||||||
|
|
||||||
1. Amazon S3 has inconsistent directory listings unless S3Guard is enabled.
|
Although S3A is (now) consistent, the S3A client still mimics `rename()`
|
||||||
1. The S3A mimics `rename()` by copying files and then deleting the originals.
|
by copying files and then deleting the originals.
|
||||||
This can fail partway through, and there is nothing to prevent any other process
|
This can fail partway through, and there is nothing to prevent any other process
|
||||||
in the cluster attempting a rename at the same time.
|
in the cluster attempting a rename at the same time.
|
||||||
|
|
||||||
As a result,
|
As a result,
|
||||||
|
|
||||||
* Files my not be listed, hence not renamed into place.
|
|
||||||
* Deleted files may still be discovered, confusing the rename process to the point
|
|
||||||
of failure.
|
|
||||||
* If a rename fails, the data is left in an unknown state.
|
* If a rename fails, the data is left in an unknown state.
|
||||||
* If more than one process attempts to commit work simultaneously, the output
|
* If more than one process attempts to commit work simultaneously, the output
|
||||||
directory may contain the results of both processes: it is no longer an exclusive
|
directory may contain the results of both processes: it is no longer an exclusive
|
||||||
operation.
|
operation.
|
||||||
*. While S3Guard may deliver the listing consistency, commit time is still
|
*. Commit time is still
|
||||||
proportional to the amount of data created. It still can't handle task failure.
|
proportional to the amount of data created. It still can't handle task failure.
|
||||||
|
|
||||||
**Using the "classic" `FileOutputCommmitter` to commit work to Amazon S3 risks
|
**Using the "classic" `FileOutputCommmitter` to commit work to Amazon S3 risks
|
||||||
|
@ -163,10 +180,8 @@ and restarting the job.
|
||||||
whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*.
|
whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*.
|
||||||
|
|
||||||
|
|
||||||
None of this algorithm works safely or swiftly when working with "raw" AWS S3 storage:
|
This algorithm does not works safely or swiftly with AWS S3 storage because
|
||||||
* Directory listing can be inconsistent: the tasks and jobs may not list all work to
|
tenames go from being fast, atomic operations to slow operations which can fail partway through.
|
||||||
be committed.
|
|
||||||
* Renames go from being fast, atomic operations to slow operations which can fail partway through.
|
|
||||||
|
|
||||||
This then is the problem which the S3A committers address:
|
This then is the problem which the S3A committers address:
|
||||||
|
|
||||||
|
@ -341,9 +356,7 @@ task commit.
|
||||||
|
|
||||||
However, it has extra requirements of the filesystem
|
However, it has extra requirements of the filesystem
|
||||||
|
|
||||||
1. It requires a consistent object store, which for Amazon S3,
|
1. [Obsolete] It requires a consistent object store.
|
||||||
means that [S3Guard](./s3guard.html) must be enabled. For third-party stores,
|
|
||||||
consult the documentation.
|
|
||||||
1. The S3A client must be configured to recognize interactions
|
1. The S3A client must be configured to recognize interactions
|
||||||
with the magic directories and treat them specially.
|
with the magic directories and treat them specially.
|
||||||
|
|
||||||
|
@ -358,14 +371,15 @@ it the least mature of the committers.
|
||||||
Partitioned Committer. Make sure you have enough hard disk capacity for all staged data.
|
Partitioned Committer. Make sure you have enough hard disk capacity for all staged data.
|
||||||
Do not use it in other situations.
|
Do not use it in other situations.
|
||||||
|
|
||||||
1. If you know that your object store is consistent, or that the processes
|
1. If you do not have a shared cluster store: use the Magic Committer.
|
||||||
writing data use S3Guard, use the Magic Committer for higher performance
|
|
||||||
writing of large amounts of data.
|
1. If you are writing large amounts of data: use the Magic Committer.
|
||||||
|
|
||||||
1. Otherwise: use the directory committer, making sure you have enough
|
1. Otherwise: use the directory committer, making sure you have enough
|
||||||
hard disk capacity for all staged data.
|
hard disk capacity for all staged data.
|
||||||
|
|
||||||
Put differently: start with the Directory Committer.
|
Now that S3 is consistent, there are fewer reasons not to use the Magic Committer.
|
||||||
|
Experiment with both to see which works best for your work.
|
||||||
|
|
||||||
## Switching to an S3A Committer
|
## Switching to an S3A Committer
|
||||||
|
|
||||||
|
@ -499,9 +513,6 @@ performance.
|
||||||
|
|
||||||
### FileSystem client setup
|
### FileSystem client setup
|
||||||
|
|
||||||
1. Use a *consistent* S3 object store. For Amazon S3, this means enabling
|
|
||||||
[S3Guard](./s3guard.html). For S3-compatible filesystems, consult the filesystem
|
|
||||||
documentation to see if it is consistent, hence compatible "out of the box".
|
|
||||||
1. Turn the magic on by `fs.s3a.committer.magic.enabled"`
|
1. Turn the magic on by `fs.s3a.committer.magic.enabled"`
|
||||||
|
|
||||||
```xml
|
```xml
|
||||||
|
@ -514,8 +525,6 @@ documentation to see if it is consistent, hence compatible "out of the box".
|
||||||
</property>
|
</property>
|
||||||
```
|
```
|
||||||
|
|
||||||
*Do not use the Magic Committer on an inconsistent S3 object store. For
|
|
||||||
Amazon S3, that means S3Guard must *always* be enabled.
|
|
||||||
|
|
||||||
|
|
||||||
### Enabling the committer
|
### Enabling the committer
|
||||||
|
@ -569,11 +578,9 @@ Conflict management is left to the execution engine itself.
|
||||||
|
|
||||||
<property>
|
<property>
|
||||||
<name>fs.s3a.committer.magic.enabled</name>
|
<name>fs.s3a.committer.magic.enabled</name>
|
||||||
<value>false</value>
|
<value>true</value>
|
||||||
<description>
|
<description>
|
||||||
Enable support in the filesystem for the S3 "Magic" committer.
|
Enable support in the filesystem for the S3 "Magic" committer.
|
||||||
When working with AWS S3, S3Guard must be enabled for the destination
|
|
||||||
bucket, as consistent metadata listings are required.
|
|
||||||
</description>
|
</description>
|
||||||
</property>
|
</property>
|
||||||
|
|
||||||
|
@ -726,7 +733,6 @@ in configuration option fs.s3a.committer.magic.enabled
|
||||||
The Job is configured to use the magic committer, but the S3A bucket has not been explicitly
|
The Job is configured to use the magic committer, but the S3A bucket has not been explicitly
|
||||||
declared as supporting it.
|
declared as supporting it.
|
||||||
|
|
||||||
The destination bucket **must** be declared as supporting the magic committer.
|
|
||||||
|
|
||||||
This can be done for those buckets which are known to be consistent, either
|
This can be done for those buckets which are known to be consistent, either
|
||||||
because [S3Guard](s3guard.html) is used to provide consistency,
|
because [S3Guard](s3guard.html) is used to provide consistency,
|
||||||
|
@ -739,10 +745,6 @@ or because the S3-compatible filesystem is known to be strongly consistent.
|
||||||
</property>
|
</property>
|
||||||
```
|
```
|
||||||
|
|
||||||
*IMPORTANT*: only enable the magic committer against object stores which
|
|
||||||
offer consistent listings. By default, Amazon S3 does not do this -which is
|
|
||||||
why the option `fs.s3a.committer.magic.enabled` is disabled by default.
|
|
||||||
|
|
||||||
|
|
||||||
Tip: you can verify that a bucket supports the magic committer through the
|
Tip: you can verify that a bucket supports the magic committer through the
|
||||||
`hadoop s3guard bucket-info` command:
|
`hadoop s3guard bucket-info` command:
|
||||||
|
|
|
@ -81,11 +81,12 @@ schemes.
|
||||||
* Supports authentication via: environment variables, Hadoop configuration
|
* Supports authentication via: environment variables, Hadoop configuration
|
||||||
properties, the Hadoop key management store and IAM roles.
|
properties, the Hadoop key management store and IAM roles.
|
||||||
* Supports per-bucket configuration.
|
* Supports per-bucket configuration.
|
||||||
* With [S3Guard](./s3guard.html), adds high performance and consistent metadata/
|
|
||||||
directory read operations. This delivers consistency as well as speed.
|
|
||||||
* Supports S3 "Server Side Encryption" for both reading and writing:
|
* Supports S3 "Server Side Encryption" for both reading and writing:
|
||||||
SSE-S3, SSE-KMS and SSE-C
|
SSE-S3, SSE-KMS and SSE-C
|
||||||
* Instrumented with Hadoop metrics.
|
* Instrumented with Hadoop metrics.
|
||||||
|
* Before S3 was consistent, provided a consistent view of inconsistent storage
|
||||||
|
through [S3Guard](./s3guard.html).
|
||||||
|
|
||||||
* Actively maintained by the open source community.
|
* Actively maintained by the open source community.
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -30,11 +30,11 @@ That's because its a very different system, as you can see:
|
||||||
| communication | RPC | HTTP GET/PUT/HEAD/LIST/COPY requests |
|
| communication | RPC | HTTP GET/PUT/HEAD/LIST/COPY requests |
|
||||||
| data locality | local storage | remote S3 servers |
|
| data locality | local storage | remote S3 servers |
|
||||||
| replication | multiple datanodes | asynchronous after upload |
|
| replication | multiple datanodes | asynchronous after upload |
|
||||||
| consistency | consistent data and listings | eventual consistent for listings, deletes and updates |
|
| consistency | consistent data and listings | consistent since November 2020|
|
||||||
| bandwidth | best: local IO, worst: datacenter network | bandwidth between servers and S3 |
|
| bandwidth | best: local IO, worst: datacenter network | bandwidth between servers and S3 |
|
||||||
| latency | low | high, especially for "low cost" directory operations |
|
| latency | low | high, especially for "low cost" directory operations |
|
||||||
| rename | fast, atomic | slow faked rename through COPY & DELETE|
|
| rename | fast, atomic | slow faked rename through COPY and DELETE|
|
||||||
| delete | fast, atomic | fast for a file, slow & non-atomic for directories |
|
| delete | fast, atomic | fast for a file, slow and non-atomic for directories |
|
||||||
| writing| incremental | in blocks; not visible until the writer is closed |
|
| writing| incremental | in blocks; not visible until the writer is closed |
|
||||||
| reading | seek() is fast | seek() is slow and expensive |
|
| reading | seek() is fast | seek() is slow and expensive |
|
||||||
| IOPs | limited only by hardware | callers are throttled to shards in an s3 bucket |
|
| IOPs | limited only by hardware | callers are throttled to shards in an s3 bucket |
|
||||||
|
|
|
@ -615,24 +615,10 @@ characters can be configured in the Hadoop configuration.
|
||||||
|
|
||||||
**Consistency**
|
**Consistency**
|
||||||
|
|
||||||
* Assume the usual S3 consistency model applies.
|
Since November 2020, AWS S3 has been fully consistent.
|
||||||
|
This also applies to S3 Select.
|
||||||
|
We do not know what happens if an object is overwritten while a query is active.
|
||||||
|
|
||||||
* When enabled, S3Guard's DynamoDB table will declare whether or not
|
|
||||||
a newly deleted file is visible: if it is marked as deleted, the
|
|
||||||
select request will be rejected with a `FileNotFoundException`.
|
|
||||||
|
|
||||||
* When an existing S3-hosted object is changed, the S3 select operation
|
|
||||||
may return the results of a SELECT call as applied to either the old
|
|
||||||
or new version.
|
|
||||||
|
|
||||||
* We don't know whether you can get partially consistent reads, or whether
|
|
||||||
an extended read ever picks up a later value.
|
|
||||||
|
|
||||||
* The AWS S3 load balancers can briefly cache 404/Not-Found entries
|
|
||||||
from a failed HEAD/GET request against a nonexistent file; this cached
|
|
||||||
entry can briefly create create inconsistency, despite the
|
|
||||||
AWS "Create is consistent" model. There is no attempt to detect or recover from
|
|
||||||
this.
|
|
||||||
|
|
||||||
**Concurrency**
|
**Concurrency**
|
||||||
|
|
||||||
|
|
|
@ -22,24 +22,39 @@
|
||||||
which can use a (consistent) database as the store of metadata about objects
|
which can use a (consistent) database as the store of metadata about objects
|
||||||
in an S3 bucket.
|
in an S3 bucket.
|
||||||
|
|
||||||
|
It was written been 2016 and 2020, *when Amazon S3 was eventually consistent.*
|
||||||
|
It compensated for the following S3 inconsistencies:
|
||||||
|
* Newly created objects excluded from directory listings.
|
||||||
|
* Newly deleted objects retained in directory listings.
|
||||||
|
* Deleted objects still visible in existence probes and opening for reading.
|
||||||
|
* S3 Load balancer 404 caching when a probe is made for an object before its creation.
|
||||||
|
|
||||||
|
It did not compensate for update inconsistency, though by storing the etag
|
||||||
|
values of objects in the database, it could detect and report problems.
|
||||||
|
|
||||||
|
Now that S3 is consistent, there is no need for S3Guard at all.
|
||||||
|
|
||||||
S3Guard
|
S3Guard
|
||||||
|
|
||||||
1. May improve performance on directory listing/scanning operations,
|
1. Permitted a consistent view of the object store.
|
||||||
|
|
||||||
|
1. Could improve performance on directory listing/scanning operations.
|
||||||
including those which take place during the partitioning period of query
|
including those which take place during the partitioning period of query
|
||||||
execution, the process where files are listed and the work divided up amongst
|
execution, the process where files are listed and the work divided up amongst
|
||||||
processes.
|
processes.
|
||||||
|
|
||||||
1. Permits a consistent view of the object store. Without this, changes in
|
|
||||||
objects may not be immediately visible, especially in listing operations.
|
|
||||||
|
|
||||||
1. Offers a platform for future performance improvements for running Hadoop
|
|
||||||
workloads on top of object stores
|
|
||||||
|
|
||||||
The basic idea is that, for each operation in the Hadoop S3 client (s3a) that
|
The basic idea was that, for each operation in the Hadoop S3 client (s3a) that
|
||||||
reads or modifies metadata, a shadow copy of that metadata is stored in a
|
reads or modifies metadata, a shadow copy of that metadata is stored in a
|
||||||
separate MetadataStore implementation. Each MetadataStore implementation
|
separate MetadataStore implementation. The store was
|
||||||
offers HDFS-like consistency for the metadata, and may also provide faster
|
1. Updated after mutating operations on the store
|
||||||
lookups for things like file status or directory listings.
|
1. Updated after list operations against S3 discovered changes
|
||||||
|
1. Looked up whenever a probe was made for a file/directory existing.
|
||||||
|
1. Queried for all objects under a path when a directory listing was made; the results were
|
||||||
|
merged with the S3 listing in a non-authoritative path, used exclusively in
|
||||||
|
authoritative mode.
|
||||||
|
|
||||||
|
|
||||||
For links to early design documents and related patches, see
|
For links to early design documents and related patches, see
|
||||||
[HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345).
|
[HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345).
|
||||||
|
@ -55,6 +70,19 @@ It is essential for all clients writing to an S3Guard-enabled
|
||||||
S3 Repository to use the feature. Clients reading the data may work directly
|
S3 Repository to use the feature. Clients reading the data may work directly
|
||||||
with the S3A data, in which case the normal S3 consistency guarantees apply.
|
with the S3A data, in which case the normal S3 consistency guarantees apply.
|
||||||
|
|
||||||
|
## Moving off S3Guard
|
||||||
|
|
||||||
|
How to move off S3Guard, given it is no longer needed.
|
||||||
|
|
||||||
|
1. Unset the option `fs.s3a.metadatastore.impl` globally/for all buckets for which it
|
||||||
|
was selected.
|
||||||
|
1. If the option `org.apache.hadoop.fs.s3a.s3guard.disabled.warn.level` has been changed from
|
||||||
|
the default (`SILENT`), change it back. You no longer need to be warned that S3Guard is disabled.
|
||||||
|
1. Restart all applications.
|
||||||
|
|
||||||
|
Once you are confident that all applications have been restarted, _Delete the DynamoDB table_.
|
||||||
|
This is to avoid paying for a database you no longer need.
|
||||||
|
This is best done from the AWS GUI.
|
||||||
|
|
||||||
## Setting up S3Guard
|
## Setting up S3Guard
|
||||||
|
|
||||||
|
@ -70,7 +98,7 @@ without S3Guard. The following values are available:
|
||||||
* `WARN`: Warn that data may be at risk in workflows.
|
* `WARN`: Warn that data may be at risk in workflows.
|
||||||
* `FAIL`: S3AFileSystem instantiation will fail.
|
* `FAIL`: S3AFileSystem instantiation will fail.
|
||||||
|
|
||||||
The default setting is INFORM. The setting is case insensitive.
|
The default setting is `SILENT`. The setting is case insensitive.
|
||||||
The required level can be set in the `core-site.xml`.
|
The required level can be set in the `core-site.xml`.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
|
@ -974,16 +974,18 @@ using an absolute XInclude reference to it.
|
||||||
**Warning do not enable any type of failure injection in production. The
|
**Warning do not enable any type of failure injection in production. The
|
||||||
following settings are for testing only.**
|
following settings are for testing only.**
|
||||||
|
|
||||||
One of the challenges with S3A integration tests is the fact that S3 is an
|
One of the challenges with S3A integration tests is the fact that S3 was an
|
||||||
eventually-consistent storage system. In practice, we rarely see delays in
|
eventually-consistent storage system. To simulate inconsistencies more
|
||||||
visibility of recently created objects both in listings (`listStatus()`) and
|
frequently than they would normally surface, S3A supports a shim layer on top of the `AmazonS3Client`
|
||||||
when getting a single file's metadata (`getFileStatus()`). Since this behavior
|
|
||||||
is rare and non-deterministic, thorough integration testing is challenging.
|
|
||||||
|
|
||||||
To address this, S3A supports a shim layer on top of the `AmazonS3Client`
|
|
||||||
class which artificially delays certain paths from appearing in listings.
|
class which artificially delays certain paths from appearing in listings.
|
||||||
This is implemented in the class `InconsistentAmazonS3Client`.
|
This is implemented in the class `InconsistentAmazonS3Client`.
|
||||||
|
|
||||||
|
Now that S3 is consistent, injecting failures during integration and
|
||||||
|
functional testing is less important.
|
||||||
|
There's no need to enable it to verify that S3Guard can recover
|
||||||
|
from consistencies, given that in production such consistencies
|
||||||
|
will never surface.
|
||||||
|
|
||||||
## Simulating List Inconsistencies
|
## Simulating List Inconsistencies
|
||||||
|
|
||||||
### Enabling the InconsistentAmazonS3CClient
|
### Enabling the InconsistentAmazonS3CClient
|
||||||
|
@ -1062,9 +1064,6 @@ The default is 5000 milliseconds (five seconds).
|
||||||
</property>
|
</property>
|
||||||
```
|
```
|
||||||
|
|
||||||
Future versions of this client will introduce new failure modes,
|
|
||||||
with simulation of S3 throttling exceptions the next feature under
|
|
||||||
development.
|
|
||||||
|
|
||||||
### Limitations of Inconsistency Injection
|
### Limitations of Inconsistency Injection
|
||||||
|
|
||||||
|
@ -1104,8 +1103,12 @@ inconsistent directory listings.
|
||||||
|
|
||||||
## <a name="s3guard"></a> Testing S3Guard
|
## <a name="s3guard"></a> Testing S3Guard
|
||||||
|
|
||||||
[S3Guard](./s3guard.html) is an extension to S3A which adds consistent metadata
|
[S3Guard](./s3guard.html) is an extension to S3A which added consistent metadata
|
||||||
listings to the S3A client. As it is part of S3A, it also needs to be tested.
|
listings to the S3A client.
|
||||||
|
|
||||||
|
It has not been needed for applications to work safely with AWS S3 since November
|
||||||
|
2020. However, it is currently still part of the codebase, and so something which
|
||||||
|
needs to be tested.
|
||||||
|
|
||||||
The basic strategy for testing S3Guard correctness consists of:
|
The basic strategy for testing S3Guard correctness consists of:
|
||||||
|
|
||||||
|
|
|
@ -1018,61 +1018,6 @@ Something has been trying to write data to "/".
|
||||||
These are the issues where S3 does not appear to behave the way a filesystem
|
These are the issues where S3 does not appear to behave the way a filesystem
|
||||||
"should".
|
"should".
|
||||||
|
|
||||||
### Visible S3 Inconsistency
|
|
||||||
|
|
||||||
Amazon S3 is *an eventually consistent object store*. That is: not a filesystem.
|
|
||||||
|
|
||||||
To reduce visible inconsistencies, use the [S3Guard](./s3guard.html) consistency
|
|
||||||
cache.
|
|
||||||
|
|
||||||
|
|
||||||
By default, Amazon S3 offers read-after-create consistency: a newly created file
|
|
||||||
is immediately visible.
|
|
||||||
There is a small quirk: a negative GET may be cached, such
|
|
||||||
that even if an object is immediately created, the fact that there "wasn't"
|
|
||||||
an object is still remembered.
|
|
||||||
|
|
||||||
That means the following sequence on its own will be consistent
|
|
||||||
```
|
|
||||||
touch(path) -> getFileStatus(path)
|
|
||||||
```
|
|
||||||
|
|
||||||
But this sequence *may* be inconsistent.
|
|
||||||
|
|
||||||
```
|
|
||||||
getFileStatus(path) -> touch(path) -> getFileStatus(path)
|
|
||||||
```
|
|
||||||
|
|
||||||
A common source of visible inconsistencies is that the S3 metadata
|
|
||||||
database —the part of S3 which serves list requests— is updated asynchronously.
|
|
||||||
Newly added or deleted files may not be visible in the index, even though direct
|
|
||||||
operations on the object (`HEAD` and `GET`) succeed.
|
|
||||||
|
|
||||||
That means the `getFileStatus()` and `open()` operations are more likely
|
|
||||||
to be consistent with the state of the object store, but without S3Guard enabled,
|
|
||||||
directory list operations such as `listStatus()`, `listFiles()`, `listLocatedStatus()`,
|
|
||||||
and `listStatusIterator()` may not see newly created files, and still list
|
|
||||||
old files.
|
|
||||||
|
|
||||||
### `FileNotFoundException` even though the file was just written.
|
|
||||||
|
|
||||||
This can be a sign of consistency problems. It may also surface if there is some
|
|
||||||
asynchronous file write operation still in progress in the client: the operation
|
|
||||||
has returned, but the write has not yet completed. While the S3A client code
|
|
||||||
does block during the `close()` operation, we suspect that asynchronous writes
|
|
||||||
may be taking place somewhere in the stack —this could explain why parallel tests
|
|
||||||
fail more often than serialized tests.
|
|
||||||
|
|
||||||
### File not found in a directory listing, even though `getFileStatus()` finds it
|
|
||||||
|
|
||||||
(Similarly: deleted file found in listing, though `getFileStatus()` reports
|
|
||||||
that it is not there)
|
|
||||||
|
|
||||||
This is a visible sign of updates to the metadata server lagging
|
|
||||||
behind the state of the underlying filesystem.
|
|
||||||
|
|
||||||
Fix: Use [S3Guard](s3guard.html).
|
|
||||||
|
|
||||||
|
|
||||||
### File not visible/saved
|
### File not visible/saved
|
||||||
|
|
||||||
|
@ -1159,6 +1104,11 @@ for more information.
|
||||||
A file being renamed and listed in the S3Guard table could not be found
|
A file being renamed and listed in the S3Guard table could not be found
|
||||||
in the S3 bucket even after multiple attempts.
|
in the S3 bucket even after multiple attempts.
|
||||||
|
|
||||||
|
Now that S3 is consistent, this is sign that the S3Guard table is out of sync with
|
||||||
|
the S3 Data.
|
||||||
|
|
||||||
|
Fix: disable S3Guard: it is no longer needed.
|
||||||
|
|
||||||
```
|
```
|
||||||
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
|
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
|
||||||
`s3a://example/sourcedir/missing': File not found on S3 after repeated attempts: `s3a://example/sourcedir/missing'
|
`s3a://example/sourcedir/missing': File not found on S3 after repeated attempts: `s3a://example/sourcedir/missing'
|
||||||
|
@ -1169,10 +1119,6 @@ at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(Rena
|
||||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
|
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
|
||||||
```
|
```
|
||||||
|
|
||||||
Either the file has been deleted, or an attempt was made to read a file before it
|
|
||||||
was created and the S3 load balancer has briefly cached the 404 returned by that
|
|
||||||
operation. This is something which AWS S3 can do for short periods.
|
|
||||||
|
|
||||||
If error occurs and the file is on S3, consider increasing the value of
|
If error occurs and the file is on S3, consider increasing the value of
|
||||||
`fs.s3a.s3guard.consistency.retry.limit`.
|
`fs.s3a.s3guard.consistency.retry.limit`.
|
||||||
|
|
||||||
|
@ -1180,29 +1126,6 @@ We also recommend using applications/application
|
||||||
options which do not rename files when committing work or when copying data
|
options which do not rename files when committing work or when copying data
|
||||||
to S3, but instead write directly to the final destination.
|
to S3, but instead write directly to the final destination.
|
||||||
|
|
||||||
### `RemoteFileChangedException`: "File to rename not found on unguarded S3 store"
|
|
||||||
|
|
||||||
```
|
|
||||||
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
|
|
||||||
`s3a://example/sourcedir/missing': File to rename not found on unguarded S3 store: `s3a://example/sourcedir/missing'
|
|
||||||
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:3231)
|
|
||||||
at org.apache.hadoop.fs.s3a.S3AFileSystem.access$700(S3AFileSystem.java:177)
|
|
||||||
at org.apache.hadoop.fs.s3a.S3AFileSystem$RenameOperationCallbacksImpl.copyFile(S3AFileSystem.java:1368)
|
|
||||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(RenameOperation.java:448)
|
|
||||||
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
|
|
||||||
```
|
|
||||||
|
|
||||||
An attempt was made to rename a file in an S3 store not protected by SGuard,
|
|
||||||
the directory list operation included the filename in its results but the
|
|
||||||
actual operation to rename the file failed.
|
|
||||||
|
|
||||||
This can happen because S3 directory listings and the store itself are not
|
|
||||||
consistent: the list operation tends to lag changes in the store.
|
|
||||||
It is possible that the file has been deleted.
|
|
||||||
|
|
||||||
The fix here is to use S3Guard. We also recommend using applications/application
|
|
||||||
options which do not rename files when committing work or when copying data
|
|
||||||
to S3, but instead write directly to the final destination.
|
|
||||||
|
|
||||||
## <a name="encryption"></a> S3 Server Side Encryption
|
## <a name="encryption"></a> S3 Server Side Encryption
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue