diff --git a/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md b/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md index 22da54cd7c1..194fa1515fc 100644 --- a/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md +++ b/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md @@ -373,6 +373,21 @@ a time proportional to the quantity of data to upload, and inversely proportiona to the network bandwidth. It may also fail —a failure that is better escalated than ignored. +1. **Authorization**. Hadoop uses the `FileStatus` class to +represent core metadata of files and directories, including the owner, group and +permissions. Object stores might not have a viable way to persist this +metadata, so they might need to populate `FileStatus` with stub values. Even if +the object store persists this metadata, it still might not be feasible for the +object store to enforce file authorization in the same way as a traditional file +system. If the object store cannot persist this metadata, then the recommended +convention is: + * File owner is reported as the current user. + * File group also is reported as the current user. + * Directory permissions are reported as 777. + * File permissions are reported as 666. + * File system APIs that set ownership and permissions execute successfully + without error, but they are no-ops. + Object stores with these characteristics, can not be used as a direct replacement for HDFS. In terms of this specification, their implementations of the specified operations do not match those required. They are considered supported diff --git a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md index c0d9157fc10..0eb36ef799d 100644 --- a/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md +++ b/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md @@ -39,7 +39,7 @@ higher performance. The specifics of using these filesystems are documented below. -### Warning #1: Object Stores are not filesystems. +### Warning #1: Object Stores are not filesystems Amazon S3 is an example of "an object store". In order to achieve scalability and especially high availability, S3 has —as many other cloud object stores have @@ -56,14 +56,38 @@ recursive file-by-file operations. They take time at least proportional to the number of files, during which time partial updates may be visible. If the operations are interrupted, the filesystem is left in an intermediate state. -### Warning #2: Because Object stores don't track modification times of directories, -features of Hadoop relying on this can have unexpected behaviour. E.g. the +### Warning #2: Object stores don't track modification times of directories + +Features of Hadoop relying on this can have unexpected behaviour. E.g. the AggregatedLogDeletionService of YARN will not remove the appropriate logfiles. For further discussion on these topics, please consult [The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html). -### Warning #3: your AWS credentials are valuable +### Warning #3: Object stores have differerent authorization models + +The object authorization model of S3 is much different from the file +authorization model of HDFS and traditional file systems. It is not feasible to +persist file ownership and permissions in S3, so S3A reports stub information +from APIs that would query this metadata: + +* File owner is reported as the current user. +* File group also is reported as the current user. Prior to Apache Hadoop +2.8.0, file group was reported as empty (no group associated), which is a +potential incompatibility problem for scripts that perform positional parsing of +shell output and other clients that expect to find a well-defined group. +* Directory permissions are reported as 777. +* File permissions are reported as 666. + +S3A does not really enforce any authorization checks on these stub permissions. +Users authenticate to an S3 bucket using AWS credentials. It's possible that +object ACLs have been defined to enforce authorization at the S3 side, but this +happens entirely within the S3 service, not within the S3A implementation. + +For further discussion on these topics, please consult +[The Hadoop FileSystem API Definition](../../../hadoop-project-dist/hadoop-common/filesystem/index.html). + +### Warning #4: Your AWS credentials are valuable Your AWS credentials not only pay for services, they offer read and write access to the data. Anyone with the credentials can not only read your datasets @@ -78,7 +102,7 @@ Do not inadvertently share these credentials through means such as If you do any of these: change your credentials immediately! -### Warning #4: the S3 client provided by Amazon EMR are not from the Apache +### Warning #5: The S3 client provided by Amazon EMR are not from the Apache Software foundation, and are only supported by Amazon. Specifically: on Amazon EMR, s3a is not supported, and amazon recommend