Completely removes S3Guard support from the S3A codebase.
If the connector is configured to use any metastore other than
the null and local stores (i.e. DynamoDB is selected) the s3a client
will raise an exception and refuse to initialize.
This is to ensure that there is no mix of S3Guard enabled and disabled
deployments with the same configuration but different hadoop releases
-it must be turned off completely.
The "hadoop s3guard" command has been retained -but the supported
subcommands have been reduced to those which are not purely S3Guard
related: "bucket-info" and "uploads".
This is major change in terms of the number of files
changed; before cherry picking subsequent s3a patches into
older releases, this patch will probably need backporting
first.
Goodbye S3Guard, your work is done. Time to die.
Contributed by Steve Loughran.
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.
This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).
Contributed by Steve Loughran.
Add support for S3 Access Points. This provides extra security as it
ensures applications are not working with buckets belong to third parties.
To bind a bucket to an access point, set the access point (ap) ARN,
which must be done for each specific bucket, using the pattern
fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN
* The global/bucket option `fs.s3a.accesspoint.required` to
mandate that buckets must declare their access point.
* This is not compatible with S3Guard.
Consult the documentation for further details.
Contributed by Bogdan Stolojan
Addresses the problem of processes running out of memory when
there are many ABFS output streams queuing data to upload,
especially when the network upload bandwidth is less than the rate
data is generated.
ABFS Output streams now buffer their blocks of data to
"disk", "bytebuffer" or "array", as set in
"fs.azure.data.blocks.buffer"
When buffering via disk, the location for temporary storage
is set in "fs.azure.buffer.dir"
For safe scaling: use "disk" (default); for performance, when
confident that upload bandwidth will never be a bottleneck,
experiment with the memory options.
The number of blocks a single stream can have queued for uploading
is set in "fs.azure.block.upload.active.blocks".
The default value is 20.
Contributed by Mehakmeet Singh.
* HDFS-16129. Fixing the signature secret file misusage in HttpFS.
The signature secret file was not used in HttpFs.
- if the configuration did not contain the deprecated
httpfs.authentication.signature.secret.file option then it
used the random secret provider
- if both option (httpfs. and hadoop.http.) was set then
the HttpFSAuthenticationFilter could not read the file
because the file path was not substituted properly
!NOTE! behavioral change: the deprecated httpfs. configuration
values are overwritten with the hadoop.http. values.
The commit also contains a follow up change to the YARN-10814,
empty secret files will result in a random secret provider.
Co-authored-by: Tamas Domok <tdomok@cloudera.com>
This adds a new class org.apache.hadoop.util.Preconditions which is
* @Private/@Unstable
* Intended to allow us to move off Google Guava
* Is designed to be trivially backportable
(i.e contains no references to guava classes internally)
Please use this instead of the guava equivalents, where possible.
Contributed by: Ahmed Hussein
Change-Id: Ic392451bcfe7d446184b7c995734bcca8c07286e
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.
fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key
The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.
(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)
Contributed by: Mehakmeet Singh
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.
fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key
The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.
(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)
Contributed by: Mehakmeet Singh
* Router to support resolving monitored namenodes with DNS
* Style
* fix style and test failure
* Add test for NNHAServiceTarget const
* Resolve comments
* Fix test
* Comments and style
* Create a simple function to extract port
* Use LambdaTestUtils.intercept
* fix javadoc
* Trigger Build
* CredentialProviderFactory to detect and report on recursion.
* S3AFS to remove incompatible providers.
* Integration Test for this.
Contributed by Steve Loughran.
Currently, GzipCodec only supports BuiltInGzipDecompressor, if native zlib is not loaded. So, without Hadoop native codec installed, saving SequenceFile using GzipCodec will throw exception like "SequenceFile doesn't work with GzipCodec without native-hadoop code!"
Same as other codecs which we migrated to using prepared packages (lz4, snappy), it will be better if we support GzipCodec generally without Hadoop native codec installed. Similar to BuiltInGzipDecompressor, we can use Java Deflater to support BuiltInGzipCompressor.
Fixes the regression caused by HADOOP-17511 by moving where the
option fs.s3a.acl.default is read -doing it before the RequestFactory
is created.
Adds
* A unit test in TestRequestFactory to verify the ACLs are set
on all file write operations.
* A new ITestS3ACannedACLs test which verifies that ACLs really
do get all the way through.
* S3A Assumed Role delegation tokens to include the IAM permission
s3:PutObjectAcl in the generated role.
Contributed by Steve Loughran
This patch cuts down the size of directory trees used for
distcp contract tests against object stores, so making
them much faster against distant/slow stores.
On abfs, the test only runs with -Dscale (as was the case for s3a already),
and has the larger scale test timeout.
After every test case, the FileSystem IOStatistics are logged,
to provide information about what IO is taking place and
what it's performance is.
There are some test cases which upload files of 1+ MiB; you can
increase the size of the upload in the option
"scale.test.distcp.file.size.kb"
Set it to zero and the large file tests are skipped.
Contributed by Steve Loughran.
This work
* Defines the behavior of FileSystem.copyFromLocal in filesystem.md
* Implements a high performance implementation of copyFromLocalOperation
for S3
* Adds a contract test for the operation: AbstractContractCopyFromLocalTest
* Implements the contract tests for Local and S3A FileSystems
Contributed by: Bogdan Stolojan
The rest endpoint would be unusable with an empty secret file
(throwing IllegalArgumentExceptions).
Any IO error would have resulted in the same fallback path.
Co-authored-by: Tamas Domok <tdomok@cloudera.com>
This (big!) patch adds support for client side encryption in AWS S3,
with keys managed by AWS-KMS.
Read the documentation in encryption.md very, very carefully before
use and consider it unstable.
S3-CSE is enabled in the existing configuration option
"fs.s3a.server-side-encryption-algorithm":
fs.s3a.server-side-encryption-algorithm=CSE-KMS
fs.s3a.server-side-encryption.key=<KMS_KEY_ID>
You cannot enable CSE and SSE in the same client, although
you can still enable a default SSE option in the S3 console.
* Filesystem list/get status operations subtract 16 bytes from the length
of all files >= 16 bytes long to compensate for the padding which CSE
adds.
* The SDK always warns about the specific algorithm chosen being
deprecated. It is critical to use this algorithm for ranged
GET requests to work (i.e. random IO). Ignore.
* Unencrypted files CANNOT BE READ.
The entire bucket SHOULD be encrypted with S3-CSE.
* Uploading files may be a bit slower as blocks are now
written sequentially.
* The Multipart Upload API is disabled when S3-CSE is active.
Contributed by Mehakmeet Singh
* Rebase trunk
* Fix to use FQDN and update config name
* Fix javac
* Style and trigger build
* Trigger Build after force push
* Trigger Build
* Fix config names
The S3A connector supports
"an auditor", a plugin which is invoked
at the start of every filesystem API call,
and whose issued "audit span" provides a context
for all REST operations against the S3 object store.
The standard auditor sets the HTTP Referrer header
on the requests with information about the API call,
such as process ID, operation name, path,
and even job ID.
If the S3 bucket is configured to log requests, this
information will be preserved there and so can be used
to analyze and troubleshoot storage IO.
Contributed by Steve Loughran.
When the S3A and ABFS filesystems are closed,
their IOStatistics are logged at debug in the log:
org.apache.hadoop.fs.statistics.IOStatisticsLogging
Set `fs.iostatistics.logging.level` to `info` for the statistics
to be logged at info. (also: `warn` or `error` for even higher
log levels).
Contributed by: Mehakmeet Singh