HADOOP-17480. Document that AWS S3 is consistent and that S3Guard is not needed (#2636)

Contributed by Steve Loughran.
This commit is contained in:
Steve Loughran 2021-01-25 13:21:34 +00:00 committed by GitHub
parent 2fa73a2387
commit 06a5d3437f
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 144 additions and 187 deletions

View File

@ -20,6 +20,27 @@ This document covers the architecture and implementation details of the S3A comm
For information on using the committers, see [the S3A Committers](./committer.html). For information on using the committers, see [the S3A Committers](./committer.html).
### January 2021 Update
Now that S3 is fully consistent, problems related to inconsistent
directory listings have gone. However the rename problem exists: committing
work by renaming directories is unsafe as well as horribly slow.
This architecture document, and the committers, were written at a time
when S3 was inconsistent. The two committers addressed this problem differently
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
the lists of files to commit from workers to the job manager/driver.
* Magic Committer: require S3Guard to offer consistent directory listings
on the object store.
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
The choice of which to use, then, is matter for experimentation.
This architecture document was written in 2017, a time when S3 was only
consistent when an extra consistency layer such as S3Guard was used.
The document indicates where requirements/constraints which existed then
are now obsolete.
## Problem: Efficient, reliable commits of work to consistent S3 buckets ## Problem: Efficient, reliable commits of work to consistent S3 buckets
@ -49,10 +70,10 @@ can be executed server-side, but as it does not complete until the in-cluster
copy has completed, it takes time proportional to the amount of data. copy has completed, it takes time proportional to the amount of data.
The rename overhead is the most visible issue, but it is not the most dangerous. The rename overhead is the most visible issue, but it is not the most dangerous.
That is the fact that path listings have no consistency guarantees, and may That is the fact that until late 2020, path listings had no consistency guarantees,
lag the addition or deletion of files. and may have lagged the addition or deletion of files.
If files are not listed, the commit operation will *not* copy them, and If files were not listed, the commit operation would *not* copy them, and
so they will not appear in the final output. so they would not appear in the final output.
The solution to this problem is closely coupled to the S3 protocol itself: The solution to this problem is closely coupled to the S3 protocol itself:
delayed completion of multi-part PUT operations delayed completion of multi-part PUT operations
@ -828,6 +849,8 @@ commit sequence in `Task.done()`, when `talkToAMTGetPermissionToCommit()`
# Requirements of an S3A Committer # Requirements of an S3A Committer
The design requirements of the S3A committer were
1. Support an eventually consistent S3 object store as a reliable direct 1. Support an eventually consistent S3 object store as a reliable direct
destination of work through the S3A filesystem client. destination of work through the S3A filesystem client.
1. Efficient: implies no rename, and a minimal amount of delay in the job driver's 1. Efficient: implies no rename, and a minimal amount of delay in the job driver's
@ -841,6 +864,7 @@ the job, and any previous incompleted jobs.
1. Security: not to permit privilege escalation from other users with 1. Security: not to permit privilege escalation from other users with
write access to the same file system(s). write access to the same file system(s).
## Features of S3 and the S3A Client ## Features of S3 and the S3A Client
@ -852,8 +876,8 @@ MR committer algorithms have significant performance problems.
1. Single-object renames are implemented as a copy and delete sequence. 1. Single-object renames are implemented as a copy and delete sequence.
1. COPY is atomic, but overwrites cannot be prevented. 1. COPY is atomic, but overwrites cannot be prevented.
1. Amazon S3 is eventually consistent on listings, deletes and updates. 1. [Obsolete] Amazon S3 is eventually consistent on listings, deletes and updates.
1. Amazon S3 has create consistency, however, the negative response of a HEAD/GET 1. [Obsolete] Amazon S3 has create consistency, however, the negative response of a HEAD/GET
performed on a path before an object was created can be cached, unintentionally performed on a path before an object was created can be cached, unintentionally
creating a create inconsistency. The S3A client library does perform such a check, creating a create inconsistency. The S3A client library does perform such a check,
on `create()` and `rename()` to check the state of the destination path, and on `create()` and `rename()` to check the state of the destination path, and
@ -872,11 +896,12 @@ data, with the `S3ABlockOutputStream` of HADOOP-13560 uploading written data
as parts of a multipart PUT once the threshold set in the configuration as parts of a multipart PUT once the threshold set in the configuration
parameter `fs.s3a.multipart.size` (default: 100MB). parameter `fs.s3a.multipart.size` (default: 100MB).
[S3Guard](./s3guard.html) adds an option of consistent view of the filesystem [S3Guard](./s3guard.html) added an option of consistent view of the filesystem
to all processes using the shared DynamoDB table as the authoritative store of to all processes using the shared DynamoDB table as the authoritative store of
metadata. Some S3-compatible object stores are fully consistent; the metadata.
proposed algorithm is designed to work with such object stores without the The proposed algorithm was designed to work with such object stores without the
need for any DynamoDB tables. need for any DynamoDB tables. Since AWS S3 became consistent in 2020, this
means that they will work directly with the store.
## Related work: Spark's `DirectOutputCommitter` ## Related work: Spark's `DirectOutputCommitter`
@ -1246,8 +1271,8 @@ for parallel committing of work, including all the error handling based on
the Netflix experience. the Netflix experience.
It differs in that it directly streams data to S3 (there is no staging), It differs in that it directly streams data to S3 (there is no staging),
and it also stores the lists of pending commits in S3 too. That mandates and it also stores the lists of pending commits in S3 too. It
consistent metadata on S3, which S3Guard provides. requires a consistent S3 store.
### Core concept: A new/modified output stream for delayed PUT commits ### Core concept: A new/modified output stream for delayed PUT commits
@ -1480,7 +1505,7 @@ The time to commit a job will be `O(files/threads)`
Every `.pendingset` file in the job attempt directory must be loaded, and a PUT Every `.pendingset` file in the job attempt directory must be loaded, and a PUT
request issued for every incomplete upload listed in the files. request issued for every incomplete upload listed in the files.
Note that it is the bulk listing of all children which is where full consistency [Obsolete] Note that it is the bulk listing of all children which is where full consistency
is required. If instead, the list of files to commit could be returned from is required. If instead, the list of files to commit could be returned from
tasks to the job committer, as the Spark commit protocol allows, it would be tasks to the job committer, as the Spark commit protocol allows, it would be
possible to commit data to an inconsistent object store. possible to commit data to an inconsistent object store.
@ -1525,7 +1550,7 @@ commit algorithms.
1. It is possible to create more than one client writing to the 1. It is possible to create more than one client writing to the
same destination file within the same S3A client/task, either sequentially or in parallel. same destination file within the same S3A client/task, either sequentially or in parallel.
1. Even with a consistent metadata store, if a job overwrites existing 1. [Obsolete] Even with a consistent metadata store, if a job overwrites existing
files, then old data may still be visible to clients reading the data, until files, then old data may still be visible to clients reading the data, until
the update has propagated to all replicas of the data. the update has propagated to all replicas of the data.
@ -1538,7 +1563,7 @@ all files in the destination directory which where not being overwritten.
for any purpose other than for the storage of pending commit data. for any purpose other than for the storage of pending commit data.
1. Unless extra code is added to every FS operation, it will still be possible 1. Unless extra code is added to every FS operation, it will still be possible
to manipulate files under the `__magic` tree. That's not bad, it just potentially to manipulate files under the `__magic` tree. That's not bad, just potentially
confusing. confusing.
1. As written data is not materialized until the commit, it will not be possible 1. As written data is not materialized until the commit, it will not be possible
@ -1693,14 +1718,6 @@ base for relative paths created underneath it.
The committers can only be tested against an S3-compatible object store. The committers can only be tested against an S3-compatible object store.
Although a consistent object store is a requirement for a production deployment
of the magic committer an inconsistent one has appeared to work during testing, simply by
adding some delays to the operations: a task commit does not succeed until
all the objects which it has PUT are visible in the LIST operation. Assuming
that further listings from the same process also show the objects, the job
committer will be able to list and commit the uploads.
The committers have some unit tests, and integration tests based on The committers have some unit tests, and integration tests based on
the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter` the protocol integration test lifted from `org.apache.hadoop.mapreduce.lib.output.TestFileOutputCommitter`
to test various state transitions of the commit mechanism has been extended to test various state transitions of the commit mechanism has been extended
@ -1766,7 +1783,8 @@ tree.
Alternatively, the fact that Spark tasks provide data to the job committer on their Alternatively, the fact that Spark tasks provide data to the job committer on their
completion means that a list of pending PUT commands could be built up, with the commit completion means that a list of pending PUT commands could be built up, with the commit
operations being executed by an S3A-specific implementation of the `FileCommitProtocol`. operations being executed by an S3A-specific implementation of the `FileCommitProtocol`.
As noted earlier, this may permit the requirement for a consistent list operation
[Obsolete] As noted earlier, this may permit the requirement for a consistent list operation
to be bypassed. It would still be important to list what was being written, as to be bypassed. It would still be important to list what was being written, as
it is needed to aid aborting work in failed tasks, but the list of files it is needed to aid aborting work in failed tasks, but the list of files
created by successful tasks could be passed directly from the task to committer, created by successful tasks could be passed directly from the task to committer,
@ -1890,9 +1908,6 @@ bandwidth and the data upload bandwidth.
No use is made of the cluster filesystem; there are no risks there. No use is made of the cluster filesystem; there are no risks there.
A consistent store is required, which, for Amazon's infrastructure, means S3Guard.
This is covered below.
A malicious user with write access to the `__magic` directory could manipulate A malicious user with write access to the `__magic` directory could manipulate
or delete the metadata of pending uploads, or potentially inject new work int or delete the metadata of pending uploads, or potentially inject new work int
the commit. Having access to the `__magic` directory implies write access the commit. Having access to the `__magic` directory implies write access
@ -1900,13 +1915,12 @@ to the parent destination directory: a malicious user could just as easily
manipulate the final output, without needing to attack the committer's intermediate manipulate the final output, without needing to attack the committer's intermediate
files. files.
### Security Risks of all committers ### Security Risks of all committers
#### Visibility #### Visibility
* If S3Guard is used for storing metadata, then the metadata is visible to [Obsolete] If S3Guard is used for storing metadata, then the metadata is visible to
all users with read access. A malicious user with write access could delete all users with read access. A malicious user with write access could delete
entries of newly generated files, so they would not be visible. entries of newly generated files, so they would not be visible.
@ -1941,7 +1955,7 @@ any of the text fields, script which could then be executed in some XSS
attack. We may wish to consider sanitizing this data on load. attack. We may wish to consider sanitizing this data on load.
* Paths in tampered data could be modified in an attempt to commit an upload across * Paths in tampered data could be modified in an attempt to commit an upload across
an existing file, or the MPU ID alterated to prematurely commit a different upload. an existing file, or the MPU ID altered to prematurely commit a different upload.
These attempts will not going to succeed, because the destination These attempts will not going to succeed, because the destination
path of the upload is declared on the initial POST to initiate the MPU, and path of the upload is declared on the initial POST to initiate the MPU, and
operations associated with the MPU must also declare the path: if the path and operations associated with the MPU must also declare the path: if the path and

View File

@ -26,10 +26,30 @@ and reliable commitment of output to S3.
For details on their internal design, see For details on their internal design, see
[S3A Committers: Architecture and Implementation](./committer_architecture.html). [S3A Committers: Architecture and Implementation](./committer_architecture.html).
### January 2021 Update
Now that S3 is fully consistent, problems related to inconsistent directory
listings have gone. However the rename problem exists: committing work by
renaming directories is unsafe as well as horribly slow.
This architecture document, and the committers, were written at a time when S3
was inconsistent. The two committers addressed this problem differently
* Staging Committer: rely on a cluster HDFS filesystem for safely propagating
the lists of files to commit from workers to the job manager/driver.
* Magic Committer: require S3Guard to offer consistent directory listings on the
object store.
With consistent S3, the Magic Committer can be safely used with any S3 bucket.
The choice of which to use, then, is matter for experimentation.
This document was written in 2017, a time when S3 was only
consistent when an extra consistency layer such as S3Guard was used. The
document indicates where requirements/constraints which existed then are now
obsolete.
## Introduction: The Commit Problem ## Introduction: The Commit Problem
Apache Hadoop MapReduce (and behind the scenes, Apache Spark) often write Apache Hadoop MapReduce (and behind the scenes, Apache Spark) often write
the output of their work to filesystems the output of their work to filesystems
@ -50,21 +70,18 @@ or it is at the destination, -in which case the rename actually succeeded.
**The S3 object store and the `s3a://` filesystem client cannot meet these requirements.* **The S3 object store and the `s3a://` filesystem client cannot meet these requirements.*
1. Amazon S3 has inconsistent directory listings unless S3Guard is enabled. Although S3A is (now) consistent, the S3A client still mimics `rename()`
1. The S3A mimics `rename()` by copying files and then deleting the originals. by copying files and then deleting the originals.
This can fail partway through, and there is nothing to prevent any other process This can fail partway through, and there is nothing to prevent any other process
in the cluster attempting a rename at the same time. in the cluster attempting a rename at the same time.
As a result, As a result,
* Files my not be listed, hence not renamed into place.
* Deleted files may still be discovered, confusing the rename process to the point
of failure.
* If a rename fails, the data is left in an unknown state. * If a rename fails, the data is left in an unknown state.
* If more than one process attempts to commit work simultaneously, the output * If more than one process attempts to commit work simultaneously, the output
directory may contain the results of both processes: it is no longer an exclusive directory may contain the results of both processes: it is no longer an exclusive
operation. operation.
*. While S3Guard may deliver the listing consistency, commit time is still *. Commit time is still
proportional to the amount of data created. It still can't handle task failure. proportional to the amount of data created. It still can't handle task failure.
**Using the "classic" `FileOutputCommmitter` to commit work to Amazon S3 risks **Using the "classic" `FileOutputCommmitter` to commit work to Amazon S3 risks
@ -163,10 +180,8 @@ and restarting the job.
whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*. whose output is in the job attempt directory, *and only rerunning all uncommitted tasks*.
None of this algorithm works safely or swiftly when working with "raw" AWS S3 storage: This algorithm does not works safely or swiftly with AWS S3 storage because
* Directory listing can be inconsistent: the tasks and jobs may not list all work to tenames go from being fast, atomic operations to slow operations which can fail partway through.
be committed.
* Renames go from being fast, atomic operations to slow operations which can fail partway through.
This then is the problem which the S3A committers address: This then is the problem which the S3A committers address:
@ -341,9 +356,7 @@ task commit.
However, it has extra requirements of the filesystem However, it has extra requirements of the filesystem
1. It requires a consistent object store, which for Amazon S3, 1. [Obsolete] It requires a consistent object store.
means that [S3Guard](./s3guard.html) must be enabled. For third-party stores,
consult the documentation.
1. The S3A client must be configured to recognize interactions 1. The S3A client must be configured to recognize interactions
with the magic directories and treat them specially. with the magic directories and treat them specially.
@ -358,14 +371,15 @@ it the least mature of the committers.
Partitioned Committer. Make sure you have enough hard disk capacity for all staged data. Partitioned Committer. Make sure you have enough hard disk capacity for all staged data.
Do not use it in other situations. Do not use it in other situations.
1. If you know that your object store is consistent, or that the processes 1. If you do not have a shared cluster store: use the Magic Committer.
writing data use S3Guard, use the Magic Committer for higher performance
writing of large amounts of data. 1. If you are writing large amounts of data: use the Magic Committer.
1. Otherwise: use the directory committer, making sure you have enough 1. Otherwise: use the directory committer, making sure you have enough
hard disk capacity for all staged data. hard disk capacity for all staged data.
Put differently: start with the Directory Committer. Now that S3 is consistent, there are fewer reasons not to use the Magic Committer.
Experiment with both to see which works best for your work.
## Switching to an S3A Committer ## Switching to an S3A Committer
@ -499,9 +513,6 @@ performance.
### FileSystem client setup ### FileSystem client setup
1. Use a *consistent* S3 object store. For Amazon S3, this means enabling
[S3Guard](./s3guard.html). For S3-compatible filesystems, consult the filesystem
documentation to see if it is consistent, hence compatible "out of the box".
1. Turn the magic on by `fs.s3a.committer.magic.enabled"` 1. Turn the magic on by `fs.s3a.committer.magic.enabled"`
```xml ```xml
@ -514,8 +525,6 @@ documentation to see if it is consistent, hence compatible "out of the box".
</property> </property>
``` ```
*Do not use the Magic Committer on an inconsistent S3 object store. For
Amazon S3, that means S3Guard must *always* be enabled.
### Enabling the committer ### Enabling the committer
@ -569,11 +578,9 @@ Conflict management is left to the execution engine itself.
<property> <property>
<name>fs.s3a.committer.magic.enabled</name> <name>fs.s3a.committer.magic.enabled</name>
<value>false</value> <value>true</value>
<description> <description>
Enable support in the filesystem for the S3 "Magic" committer. Enable support in the filesystem for the S3 "Magic" committer.
When working with AWS S3, S3Guard must be enabled for the destination
bucket, as consistent metadata listings are required.
</description> </description>
</property> </property>
@ -726,7 +733,6 @@ in configuration option fs.s3a.committer.magic.enabled
The Job is configured to use the magic committer, but the S3A bucket has not been explicitly The Job is configured to use the magic committer, but the S3A bucket has not been explicitly
declared as supporting it. declared as supporting it.
The destination bucket **must** be declared as supporting the magic committer.
This can be done for those buckets which are known to be consistent, either This can be done for those buckets which are known to be consistent, either
because [S3Guard](s3guard.html) is used to provide consistency, because [S3Guard](s3guard.html) is used to provide consistency,
@ -739,10 +745,6 @@ or because the S3-compatible filesystem is known to be strongly consistent.
</property> </property>
``` ```
*IMPORTANT*: only enable the magic committer against object stores which
offer consistent listings. By default, Amazon S3 does not do this -which is
why the option `fs.s3a.committer.magic.enabled` is disabled by default.
Tip: you can verify that a bucket supports the magic committer through the Tip: you can verify that a bucket supports the magic committer through the
`hadoop s3guard bucket-info` command: `hadoop s3guard bucket-info` command:

View File

@ -81,11 +81,12 @@ schemes.
* Supports authentication via: environment variables, Hadoop configuration * Supports authentication via: environment variables, Hadoop configuration
properties, the Hadoop key management store and IAM roles. properties, the Hadoop key management store and IAM roles.
* Supports per-bucket configuration. * Supports per-bucket configuration.
* With [S3Guard](./s3guard.html), adds high performance and consistent metadata/
directory read operations. This delivers consistency as well as speed.
* Supports S3 "Server Side Encryption" for both reading and writing: * Supports S3 "Server Side Encryption" for both reading and writing:
SSE-S3, SSE-KMS and SSE-C SSE-S3, SSE-KMS and SSE-C
* Instrumented with Hadoop metrics. * Instrumented with Hadoop metrics.
* Before S3 was consistent, provided a consistent view of inconsistent storage
through [S3Guard](./s3guard.html).
* Actively maintained by the open source community. * Actively maintained by the open source community.

View File

@ -30,11 +30,11 @@ That's because its a very different system, as you can see:
| communication | RPC | HTTP GET/PUT/HEAD/LIST/COPY requests | | communication | RPC | HTTP GET/PUT/HEAD/LIST/COPY requests |
| data locality | local storage | remote S3 servers | | data locality | local storage | remote S3 servers |
| replication | multiple datanodes | asynchronous after upload | | replication | multiple datanodes | asynchronous after upload |
| consistency | consistent data and listings | eventual consistent for listings, deletes and updates | | consistency | consistent data and listings | consistent since November 2020|
| bandwidth | best: local IO, worst: datacenter network | bandwidth between servers and S3 | | bandwidth | best: local IO, worst: datacenter network | bandwidth between servers and S3 |
| latency | low | high, especially for "low cost" directory operations | | latency | low | high, especially for "low cost" directory operations |
| rename | fast, atomic | slow faked rename through COPY & DELETE| | rename | fast, atomic | slow faked rename through COPY and DELETE|
| delete | fast, atomic | fast for a file, slow & non-atomic for directories | | delete | fast, atomic | fast for a file, slow and non-atomic for directories |
| writing| incremental | in blocks; not visible until the writer is closed | | writing| incremental | in blocks; not visible until the writer is closed |
| reading | seek() is fast | seek() is slow and expensive | | reading | seek() is fast | seek() is slow and expensive |
| IOPs | limited only by hardware | callers are throttled to shards in an s3 bucket | | IOPs | limited only by hardware | callers are throttled to shards in an s3 bucket |

View File

@ -615,24 +615,10 @@ characters can be configured in the Hadoop configuration.
**Consistency** **Consistency**
* Assume the usual S3 consistency model applies. Since November 2020, AWS S3 has been fully consistent.
This also applies to S3 Select.
We do not know what happens if an object is overwritten while a query is active.
* When enabled, S3Guard's DynamoDB table will declare whether or not
a newly deleted file is visible: if it is marked as deleted, the
select request will be rejected with a `FileNotFoundException`.
* When an existing S3-hosted object is changed, the S3 select operation
may return the results of a SELECT call as applied to either the old
or new version.
* We don't know whether you can get partially consistent reads, or whether
an extended read ever picks up a later value.
* The AWS S3 load balancers can briefly cache 404/Not-Found entries
from a failed HEAD/GET request against a nonexistent file; this cached
entry can briefly create create inconsistency, despite the
AWS "Create is consistent" model. There is no attempt to detect or recover from
this.
**Concurrency** **Concurrency**

View File

@ -22,24 +22,39 @@
which can use a (consistent) database as the store of metadata about objects which can use a (consistent) database as the store of metadata about objects
in an S3 bucket. in an S3 bucket.
It was written been 2016 and 2020, *when Amazon S3 was eventually consistent.*
It compensated for the following S3 inconsistencies:
* Newly created objects excluded from directory listings.
* Newly deleted objects retained in directory listings.
* Deleted objects still visible in existence probes and opening for reading.
* S3 Load balancer 404 caching when a probe is made for an object before its creation.
It did not compensate for update inconsistency, though by storing the etag
values of objects in the database, it could detect and report problems.
Now that S3 is consistent, there is no need for S3Guard at all.
S3Guard S3Guard
1. May improve performance on directory listing/scanning operations, 1. Permitted a consistent view of the object store.
1. Could improve performance on directory listing/scanning operations.
including those which take place during the partitioning period of query including those which take place during the partitioning period of query
execution, the process where files are listed and the work divided up amongst execution, the process where files are listed and the work divided up amongst
processes. processes.
1. Permits a consistent view of the object store. Without this, changes in
objects may not be immediately visible, especially in listing operations.
1. Offers a platform for future performance improvements for running Hadoop
workloads on top of object stores
The basic idea is that, for each operation in the Hadoop S3 client (s3a) that The basic idea was that, for each operation in the Hadoop S3 client (s3a) that
reads or modifies metadata, a shadow copy of that metadata is stored in a reads or modifies metadata, a shadow copy of that metadata is stored in a
separate MetadataStore implementation. Each MetadataStore implementation separate MetadataStore implementation. The store was
offers HDFS-like consistency for the metadata, and may also provide faster 1. Updated after mutating operations on the store
lookups for things like file status or directory listings. 1. Updated after list operations against S3 discovered changes
1. Looked up whenever a probe was made for a file/directory existing.
1. Queried for all objects under a path when a directory listing was made; the results were
merged with the S3 listing in a non-authoritative path, used exclusively in
authoritative mode.
For links to early design documents and related patches, see For links to early design documents and related patches, see
[HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345). [HADOOP-13345](https://issues.apache.org/jira/browse/HADOOP-13345).
@ -55,6 +70,19 @@ It is essential for all clients writing to an S3Guard-enabled
S3 Repository to use the feature. Clients reading the data may work directly S3 Repository to use the feature. Clients reading the data may work directly
with the S3A data, in which case the normal S3 consistency guarantees apply. with the S3A data, in which case the normal S3 consistency guarantees apply.
## Moving off S3Guard
How to move off S3Guard, given it is no longer needed.
1. Unset the option `fs.s3a.metadatastore.impl` globally/for all buckets for which it
was selected.
1. If the option `org.apache.hadoop.fs.s3a.s3guard.disabled.warn.level` has been changed from
the default (`SILENT`), change it back. You no longer need to be warned that S3Guard is disabled.
1. Restart all applications.
Once you are confident that all applications have been restarted, _Delete the DynamoDB table_.
This is to avoid paying for a database you no longer need.
This is best done from the AWS GUI.
## Setting up S3Guard ## Setting up S3Guard
@ -70,7 +98,7 @@ without S3Guard. The following values are available:
* `WARN`: Warn that data may be at risk in workflows. * `WARN`: Warn that data may be at risk in workflows.
* `FAIL`: S3AFileSystem instantiation will fail. * `FAIL`: S3AFileSystem instantiation will fail.
The default setting is INFORM. The setting is case insensitive. The default setting is `SILENT`. The setting is case insensitive.
The required level can be set in the `core-site.xml`. The required level can be set in the `core-site.xml`.
--- ---

View File

@ -974,16 +974,18 @@ using an absolute XInclude reference to it.
**Warning do not enable any type of failure injection in production. The **Warning do not enable any type of failure injection in production. The
following settings are for testing only.** following settings are for testing only.**
One of the challenges with S3A integration tests is the fact that S3 is an One of the challenges with S3A integration tests is the fact that S3 was an
eventually-consistent storage system. In practice, we rarely see delays in eventually-consistent storage system. To simulate inconsistencies more
visibility of recently created objects both in listings (`listStatus()`) and frequently than they would normally surface, S3A supports a shim layer on top of the `AmazonS3Client`
when getting a single file's metadata (`getFileStatus()`). Since this behavior
is rare and non-deterministic, thorough integration testing is challenging.
To address this, S3A supports a shim layer on top of the `AmazonS3Client`
class which artificially delays certain paths from appearing in listings. class which artificially delays certain paths from appearing in listings.
This is implemented in the class `InconsistentAmazonS3Client`. This is implemented in the class `InconsistentAmazonS3Client`.
Now that S3 is consistent, injecting failures during integration and
functional testing is less important.
There's no need to enable it to verify that S3Guard can recover
from consistencies, given that in production such consistencies
will never surface.
## Simulating List Inconsistencies ## Simulating List Inconsistencies
### Enabling the InconsistentAmazonS3CClient ### Enabling the InconsistentAmazonS3CClient
@ -1062,9 +1064,6 @@ The default is 5000 milliseconds (five seconds).
</property> </property>
``` ```
Future versions of this client will introduce new failure modes,
with simulation of S3 throttling exceptions the next feature under
development.
### Limitations of Inconsistency Injection ### Limitations of Inconsistency Injection
@ -1104,8 +1103,12 @@ inconsistent directory listings.
## <a name="s3guard"></a> Testing S3Guard ## <a name="s3guard"></a> Testing S3Guard
[S3Guard](./s3guard.html) is an extension to S3A which adds consistent metadata [S3Guard](./s3guard.html) is an extension to S3A which added consistent metadata
listings to the S3A client. As it is part of S3A, it also needs to be tested. listings to the S3A client.
It has not been needed for applications to work safely with AWS S3 since November
2020. However, it is currently still part of the codebase, and so something which
needs to be tested.
The basic strategy for testing S3Guard correctness consists of: The basic strategy for testing S3Guard correctness consists of:

View File

@ -1018,61 +1018,6 @@ Something has been trying to write data to "/".
These are the issues where S3 does not appear to behave the way a filesystem These are the issues where S3 does not appear to behave the way a filesystem
"should". "should".
### Visible S3 Inconsistency
Amazon S3 is *an eventually consistent object store*. That is: not a filesystem.
To reduce visible inconsistencies, use the [S3Guard](./s3guard.html) consistency
cache.
By default, Amazon S3 offers read-after-create consistency: a newly created file
is immediately visible.
There is a small quirk: a negative GET may be cached, such
that even if an object is immediately created, the fact that there "wasn't"
an object is still remembered.
That means the following sequence on its own will be consistent
```
touch(path) -> getFileStatus(path)
```
But this sequence *may* be inconsistent.
```
getFileStatus(path) -> touch(path) -> getFileStatus(path)
```
A common source of visible inconsistencies is that the S3 metadata
database —the part of S3 which serves list requests— is updated asynchronously.
Newly added or deleted files may not be visible in the index, even though direct
operations on the object (`HEAD` and `GET`) succeed.
That means the `getFileStatus()` and `open()` operations are more likely
to be consistent with the state of the object store, but without S3Guard enabled,
directory list operations such as `listStatus()`, `listFiles()`, `listLocatedStatus()`,
and `listStatusIterator()` may not see newly created files, and still list
old files.
### `FileNotFoundException` even though the file was just written.
This can be a sign of consistency problems. It may also surface if there is some
asynchronous file write operation still in progress in the client: the operation
has returned, but the write has not yet completed. While the S3A client code
does block during the `close()` operation, we suspect that asynchronous writes
may be taking place somewhere in the stack —this could explain why parallel tests
fail more often than serialized tests.
### File not found in a directory listing, even though `getFileStatus()` finds it
(Similarly: deleted file found in listing, though `getFileStatus()` reports
that it is not there)
This is a visible sign of updates to the metadata server lagging
behind the state of the underlying filesystem.
Fix: Use [S3Guard](s3guard.html).
### File not visible/saved ### File not visible/saved
@ -1159,6 +1104,11 @@ for more information.
A file being renamed and listed in the S3Guard table could not be found A file being renamed and listed in the S3Guard table could not be found
in the S3 bucket even after multiple attempts. in the S3 bucket even after multiple attempts.
Now that S3 is consistent, this is sign that the S3Guard table is out of sync with
the S3 Data.
Fix: disable S3Guard: it is no longer needed.
``` ```
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/) org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
`s3a://example/sourcedir/missing': File not found on S3 after repeated attempts: `s3a://example/sourcedir/missing' `s3a://example/sourcedir/missing': File not found on S3 after repeated attempts: `s3a://example/sourcedir/missing'
@ -1169,10 +1119,6 @@ at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(Rena
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412) at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
``` ```
Either the file has been deleted, or an attempt was made to read a file before it
was created and the S3 load balancer has briefly cached the 404 returned by that
operation. This is something which AWS S3 can do for short periods.
If error occurs and the file is on S3, consider increasing the value of If error occurs and the file is on S3, consider increasing the value of
`fs.s3a.s3guard.consistency.retry.limit`. `fs.s3a.s3guard.consistency.retry.limit`.
@ -1180,29 +1126,6 @@ We also recommend using applications/application
options which do not rename files when committing work or when copying data options which do not rename files when committing work or when copying data
to S3, but instead write directly to the final destination. to S3, but instead write directly to the final destination.
### `RemoteFileChangedException`: "File to rename not found on unguarded S3 store"
```
org.apache.hadoop.fs.s3a.RemoteFileChangedException: copyFile(/sourcedir/missing, /destdir/)
`s3a://example/sourcedir/missing': File to rename not found on unguarded S3 store: `s3a://example/sourcedir/missing'
at org.apache.hadoop.fs.s3a.S3AFileSystem.copyFile(S3AFileSystem.java:3231)
at org.apache.hadoop.fs.s3a.S3AFileSystem.access$700(S3AFileSystem.java:177)
at org.apache.hadoop.fs.s3a.S3AFileSystem$RenameOperationCallbacksImpl.copyFile(S3AFileSystem.java:1368)
at org.apache.hadoop.fs.s3a.impl.RenameOperation.copySourceAndUpdateTracker(RenameOperation.java:448)
at org.apache.hadoop.fs.s3a.impl.RenameOperation.lambda$initiateCopy$0(RenameOperation.java:412)
```
An attempt was made to rename a file in an S3 store not protected by SGuard,
the directory list operation included the filename in its results but the
actual operation to rename the file failed.
This can happen because S3 directory listings and the store itself are not
consistent: the list operation tends to lag changes in the store.
It is possible that the file has been deleted.
The fix here is to use S3Guard. We also recommend using applications/application
options which do not rename files when committing work or when copying data
to S3, but instead write directly to the final destination.
## <a name="encryption"></a> S3 Server Side Encryption ## <a name="encryption"></a> S3 Server Side Encryption