HADOOP-13309. Document S3A known limitations in file ownership and permission model. Contributed by Chris Nauroth.

This commit is contained in:
Chris Nauroth 2016-10-25 09:03:03 -07:00
parent dbd205762e
commit 309a43925c
2 changed files with 44 additions and 5 deletions

View File

@ -373,6 +373,21 @@ a time proportional to the quantity of data to upload, and inversely proportiona
to the network bandwidth. It may also fail —a failure that is better
escalated than ignored.
1. **Authorization**. Hadoop uses the `FileStatus` class to
represent core metadata of files and directories, including the owner, group and
permissions. Object stores might not have a viable way to persist this
metadata, so they might need to populate `FileStatus` with stub values. Even if
the object store persists this metadata, it still might not be feasible for the
object store to enforce file authorization in the same way as a traditional file
system. If the object store cannot persist this metadata, then the recommended
convention is:
* File owner is reported as the current user.
* File group also is reported as the current user.
* Directory permissions are reported as 777.
* File permissions are reported as 666.
* File system APIs that set ownership and permissions execute successfully
without error, but they are no-ops.
Object stores with these characteristics, can not be used as a direct replacement
for HDFS. In terms of this specification, their implementations of the
specified operations do not match those required. They are considered supported

View File

@ -39,7 +39,7 @@ higher performance.
The specifics of using these filesystems are documented below.
### Warning #1: Object Stores are not filesystems.
### Warning #1: Object Stores are not filesystems
Amazon S3 is an example of "an object store". In order to achieve scalability
and especially high availability, S3 has —as many other cloud object stores have
@ -56,14 +56,38 @@ recursive file-by-file operations. They take time at least proportional to
the number of files, during which time partial updates may be visible. If
the operations are interrupted, the filesystem is left in an intermediate state.
### Warning #2: Because Object stores don't track modification times of directories,
features of Hadoop relying on this can have unexpected behaviour. E.g. the
### Warning #2: Object stores don't track modification times of directories
Features of Hadoop relying on this can have unexpected behaviour. E.g. the
AggregatedLogDeletionService of YARN will not remove the appropriate logfiles.
For further discussion on these topics, please consult
[The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html).
### Warning #3: your AWS credentials are valuable
### Warning #3: Object stores have differerent authorization models
The object authorization model of S3 is much different from the file
authorization model of HDFS and traditional file systems. It is not feasible to
persist file ownership and permissions in S3, so S3A reports stub information
from APIs that would query this metadata:
* File owner is reported as the current user.
* File group also is reported as the current user. Prior to Apache Hadoop
2.8.0, file group was reported as empty (no group associated), which is a
potential incompatibility problem for scripts that perform positional parsing of
shell output and other clients that expect to find a well-defined group.
* Directory permissions are reported as 777.
* File permissions are reported as 666.
S3A does not really enforce any authorization checks on these stub permissions.
Users authenticate to an S3 bucket using AWS credentials. It's possible that
object ACLs have been defined to enforce authorization at the S3 side, but this
happens entirely within the S3 service, not within the S3A implementation.
For further discussion on these topics, please consult
[The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html).
### Warning #4: Your AWS credentials are valuable
Your AWS credentials not only pay for services, they offer read and write
access to the data. Anyone with the credentials can not only read your datasets
@ -78,7 +102,7 @@ Do not inadvertently share these credentials through means such as
If you do any of these: change your credentials immediately!
### Warning #4: the S3 client provided by Amazon EMR are not from the Apache
### Warning #5: The S3 client provided by Amazon EMR are not from the Apache
Software foundation, and are only supported by Amazon.
Specifically: on Amazon EMR, s3a is not supported, and amazon recommend