Commit Graph

1548 Commits

Author SHA1 Message Date
Steve Loughran a1c0673526
HADOOP-18198. Preparing for 3.3.3 release
Change-Id: Idebf79191dc91dad52073f2c63ee9ab3a99464d9
2022-04-14 17:19:16 +01:00
Ayush Saxena f9cccfa7ea
HADOOP-18096. Distcp: Sync moves filtered file to home directory rather than deleting. (#3940). Contributed by Ayush Saxena.
Reviewed-by: Steve Loughran <stevel@apache.org>
Reviewed-by: stack <stack@apache.org>
2022-02-11 02:10:02 +05:30
Petre Bogdan Stolojan 8cd8e435fb
HADOOP-17198. Support S3 Access Points (#3260) (branch-3.3.2) (#3955)
Add support for S3 Access Points. This provides extra security as it
ensures applications are not working with buckets belong to third parties.

To bind a bucket to an access point, set the access point (ap) ARN,
which must be done for each specific bucket, using the pattern

fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN

* The global/bucket option `fs.s3a.accesspoint.required` to
mandate that buckets must declare their access point.
* This is not compatible with S3Guard.

Consult the documentation for further details.
2022-02-04 10:09:00 -08:00
Steve Loughran 35e352a88b
HADOOP-18094. Disable S3A auditing by default.
See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

* Adds a new option fs.s3a.audit.enabled to controls whether or not auditing
is enabled. This is false by default.

* When false, the S3A auditing manager is NoopAuditManagerS3A,
which was formerly only used for unit tests and
during filsystem initialization.

* When true, ActiveAuditManagerS3A is used for managing auditing,
allowing auditing events to be reported.

* updates documentation and tests.

This patch does not fix the underlying leak. When auditing is enabled,
long-lived threads will retain references to the audit managers
of S3A filesystem instances which have already been closed.

Contributed by Steve Loughran.

Change-Id: I671e594cd59e8ca77a1f65be791ad0ae9530b8d9
2022-01-24 15:06:01 +00:00
monthonk d101e4d7fa HADOOP-14334. S3 SSEC tests to downgrade when running against a mandatory encryption object store (#3870)
Contributed by Monthon Klongklaew

Change-Id: Ib275c9690bbc90170c6a442ded198fe006c20bc1
2022-01-15 09:25:32 -08:00
Ayush Saxena 46b1411189 HADOOP-18056. DistCp: Filter duplicates in the source paths. (#3825). Contributed by Ayush Saxena.
Reviewed-by: tomscut <litao@bigo.sg>
Reviewed-by: Steve Loughran <stevel@apache.org>
2022-01-15 09:25:25 -08:00
Akira Ajisaka 84a767ea24 HADOOP-18045. Disable TestDynamometerInfra (#3829)
Reviewed-by: Fei Hui <feihui.ustc@gmail.com>
(cherry picked from commit dba139cd0f)
2022-01-04 14:38:24 -08:00
Anoop Sam John de27fa0097 HADOOP-17643 WASB : Make metadata checks case insensitive (#3103) 2022-01-04 14:37:12 -08:00
Akira Ajisaka c257abca21 HADOOP-18040. Use maven.test.failure.ignore instead of ignoreTestFailure (#3774)
Reviewed-by: Masatake Iwasaki <iwasakims@apache.org>
(cherry picked from commit 9b9e2ef87f)

 Conflicts:
	hadoop-tools/hadoop-federation-balance/pom.xml
2022-01-04 14:37:08 -08:00
Steve Loughran 19b99c1ecc HADOOP-17979. Add Interface EtagSource to allow FileStatus subclasses to provide etags (#3633)
Contributed by Steve Loughran

Change-Id: I596205d788f623114c12962941445432e2036c34
2022-01-04 14:36:10 -08:00
Steve Loughran 5314f2e7a3 HADOOP-18002. ABFS rename idempotency broken -remove recovery (#3641)
Cut modtime-based rename recovery as object modification time
is not updated during rename operation.
Applications will have to use etag API of HADOOP-17979
and implement it themselves.

Why not do the HEAD and etag recovery in ABFS client?
Cuts the IO capacity in half so kills job commit performance.
The manifest committer of MAPREDUCE-7341 will do this recovery
and act as the reference implementation of the algorithm.

Contributed by: Steve Loughran

Change-Id: I810054c9fd05041dac552f13d31fb15d7524721b
2022-01-04 14:27:22 -08:00
GuoPhilipse dd9fd3b01d HADOOP-18026. Fix default value of Magic committer (#3723)
Contributed by guophilipse

Change-Id: If915623c76619dd3d3b3bdf989688fa13e56fec1
2022-01-04 14:21:00 -08:00
Chao Sun ce74635dd4 Preparing for 3.3.1 release 2021-11-16 16:07:17 -08:00
Steve Loughran 7b632dd22b Revert "HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)"
This reverts commit 0379aebafe.
2021-11-05 14:22:07 +00:00
sumangala-patki 689dd7bf17
HADOOP-17863. ABFS: Fix compiler deprecation warning in TextFileBasedIdentityHandler (#3332)
Closes #3332

Contributed by Sumangala Patki

Change-Id: I2abd33bd62bb734a431cccfc50a52bdeb2bf7db6
2021-11-05 12:55:45 +00:00
Jinhu Wu 0557da6820 HADOOP-17374. support listObjectV2 (#3587)
(cherry picked from commit a9c51ea57d)
2021-11-04 21:45:04 -07:00
sumangala-patki 0379aebafe
HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)
Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes #3341

Contributed by Sumangala Patki

Change-Id: Ied4dec35c81e94efe5f999acae4bb8fde278202e
2021-11-04 15:57:42 +00:00
Steve Loughran a68671eaf7
HADOOP-17928. Syncable: S3A to warn and downgrade (#3585)
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.

This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).

Contributed by Steve Loughran.

Change-Id: I0a02ec1e622343619f147f94158c18928a73a885
2021-11-04 14:41:42 +00:00
Anoop Sam John 913d06ad4d
HADOOP-17770 WASB : Support disabling buffered reads in positional reads (#3233) 2021-10-22 11:45:42 +05:30
Tamas Domok e7785bb7e5
HADOOP-17974. Import statements in hadoop-aws trigger clover failures. #3572
Contributed by Tamas Domok

Change-Id: I47da62596ce23d71709c65eb493bf656967d4415
2021-10-21 18:43:54 +01:00
Mehakmeet Singh bd077c3814
HADOOP-17953. S3A: Tests to lookup global or per-bucket configuration for encryption algorithm (#3525)
Followup to S3-CSE work of HADOOP-13887

Contributed by Mehakmeet Singh
2021-10-21 12:03:50 +01:00
adol001 21bd015df2
HADOOP-17932. Distcp file length comparison have no effect (#3519)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 280ae1c0a9)
2021-10-18 19:09:09 +09:00
Viraj Jasani 77ee5a4266
HADOOP-17950. Provide replacement for deprecated APIs of commons-io IOUtils (#3515)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 8071dbb9c6)
2021-10-07 11:00:19 +09:00
Steve Loughran 6f7b45641a
HADOOP-17922. move to fs.s3a.encryption.algorithm - JCEKS integration (#3466)
The ordering of the resolution of new and deprecated s3a encryption options
& secrets is the same when JCEKS and other hadoop credentials stores are used
to store them as when they are in XML files: per-bucket settings always take
priority over global values, even when the bucket-level options use the
old option names.

Contributed by Mehakmeet Singh and Steve Loughran

Change-Id: I871672071efa2eb6b600cb2658fceeef57f658a3
2021-10-05 11:39:43 +01:00
Mehakmeet Singh 769059c2f5
HADOOP-17871. S3A CSE: minor tuning (#3412)
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.

fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key

The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.

(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)

Contributed by: Mehakmeet Singh

Change-Id: I51804b21b287dbce18864f0a6ad17126aba2b281
2021-10-05 11:39:25 +01:00
Mehakmeet Singh abb367aec6
HADOOP-17817.HADOOP-17823. S3A to raise IOE if both S3-CSE and S3Guard enabled (#3239)
S3A S3Guard tests to skip if S3-CSE are enabled (#3263)

    Follow on to
    * HADOOP-13887. Encrypt S3A data client-side with AWS SDK (S3-CSE)

    If the S3A bucket is set up to use S3-CSE encryption, all tests which turn
    on S3Guard are skipped, so they don't raise any exceptions about
    incompatible configurations.

Contributed by Mehakmeet Singh

Change-Id: I9f4188109b56a1f4e5a31fae265d980c5795db1e
2021-10-05 11:38:57 +01:00
Mehakmeet Singh aee975a136
HADOOP-13887. Support S3 client side encryption (S3-CSE) using AWS-SDK (#2706)
This (big!) patch adds support for client side encryption in AWS S3,
with keys managed by AWS-KMS.

Read the documentation in encryption.md very, very carefully before
use and consider it unstable.

S3-CSE is enabled in the existing configuration option
"fs.s3a.server-side-encryption-algorithm":

fs.s3a.server-side-encryption-algorithm=CSE-KMS
fs.s3a.server-side-encryption.key=<KMS_KEY_ID>

You cannot enable CSE and SSE in the same client, although
you can still enable a default SSE option in the S3 console.

* Filesystem list/get status operations subtract 16 bytes from the length
  of all files >= 16 bytes long to compensate for the padding which CSE
  adds.
* The SDK always warns about the specific algorithm chosen being
  deprecated. It is critical to use this algorithm for ranged
  GET requests to work (i.e. random IO). Ignore.
* Unencrypted files CANNOT BE READ.
  The entire bucket SHOULD be encrypted with S3-CSE.
* Uploading files may be a bit slower as blocks are now
  written sequentially.
* The Multipart Upload API is disabled when S3-CSE is active.

Contributed by Mehakmeet Singh

Change-Id: Ie1a27a036a39db66a67e9c6d33bc78d54ea708a0
2021-10-05 11:37:41 +01:00
Josh Elser feeaebeb84
HADOOP-17934. ABFS: Make sure the AbfsHttpOperation is non-null before using it (#3477)
Contributed by: Josh Elser

Change-Id: I24a2e0322d8cae2d72d65c7f3d8a74580a418317
2021-10-04 20:54:39 +01:00
Mehakmeet Singh 8e5620cd9e
HADOOP-17195. ABFS: OutOfMemory error while uploading huge files (#3446)
Addresses the problem of processes running out of memory when
there are many ABFS output streams queuing data to upload,
especially when the network upload bandwidth is less than the rate
data is generated.

ABFS Output streams now buffer their blocks of data to
"disk", "bytebuffer" or "array", as set in
"fs.azure.data.blocks.buffer"

When buffering via disk, the location for temporary storage
is set in "fs.azure.buffer.dir"

For safe scaling: use "disk" (default); for performance, when
confident that upload bandwidth will never be a bottleneck,
experiment with the memory options.

The number of blocks a single stream can have queued for uploading
is set in "fs.azure.block.upload.active.blocks".
The default value is 20.

Contributed by Mehakmeet Singh.
2021-09-22 11:19:16 +01:00
sumangala-patki dd30db78e7
HADOOP-17290. ABFS: Add Identifiers to Client Request Header (#2520)
Contributed by Sumangala Patki.

(cherry picked from commit 35570e414a)
2021-09-21 16:45:51 +01:00
sumangala-patki 1cb9e747eb
HADOOP-17618. ABFS: Partially obfuscate SAS object IDs in Logs (#2845)
Contributed by Sumangala Patki

(cherry picked from commit 3450522c2f)
2021-09-09 14:04:12 +01:00
Steve Loughran a2242df10a
HADOOP-17894. CredentialProviderFactory.getProviders() recursion loading JCEKS file from S3A (#3393)
* CredentialProviderFactory to detect and report on recursion.
* S3AFS to remove incompatible providers.
* Integration Test for this.

Contributed by Steve Loughran.

Change-Id: Ia247b3c9fe8488ffdb7f57b40eb6e37c57e522ef
2021-09-08 17:00:20 +01:00
Mukund Thakur 3b1c594355 HADOOP-17156. ABFS: Release the byte buffers held by input streams in close() (#3285)
Contributed By: Mukund Thakur
2021-09-07 15:29:22 +05:30
Dongjoon Hyun 8606b2cddd
HADOOP-17869. `fs.s3a.connection.maximum` should be bigger than `fs.s3a.threads.max` (#3337).
The value of `fs.s3a.connection.maximum` has been increased to 96

Contributed by Dongjoon Hyun

Change-Id: I9020a2bfd2a67fa7a2ec0598ed9d63e78ee99c73
2021-08-30 18:31:57 +01:00
Steve Loughran c1ad91e72d
HADOOP-17822. fs.s3a.acl.default not working after S3A Audit feature (#3249)
Fixes the regression caused by HADOOP-17511 by moving where the
option  fs.s3a.acl.default is read -doing it before the RequestFactory
is created.

Adds

* A unit test in TestRequestFactory to verify the ACLs are set
  on all file write operations.
* A new ITestS3ACannedACLs test which verifies that ACLs really
  do get all the way through.
* S3A Assumed Role delegation tokens to include the IAM permission
  s3:PutObjectAcl in the generated role.

Contributed by Steve Loughran

Change-Id: I3abac6a1b9e150b6b6df0af7c2c70093f8f518cb
2021-08-02 15:33:34 +01:00
Steve Loughran 26514b6534 HADOOP-17628. Distcp contract test is really slow with ABFS and S3A; timing out. (#3240)
This patch cuts down the size of directory trees used for
distcp contract tests against object stores, so making
them much faster against distant/slow stores.

On abfs, the test only runs with -Dscale (as was the case for s3a already),
and has the larger scale test timeout.

After every test case, the FileSystem IOStatistics are logged,
to provide information about what IO is taking place and
what it's performance is.

There are some test cases which upload files of 1+ MiB; you can
increase the size of the upload in the option
"scale.test.distcp.file.size.kb" 
Set it to zero and the large file tests are skipped.

Contributed by Steve Loughran.
2021-08-02 12:58:37 +01:00
Bobby Wang 904cdd0b00
HADOOP-17812. NPE in S3AInputStream read() after failure to reconnect to store (#3222)
This improves error handling after multiple failures reading data
-when the read fails and attempts to reconnect() also fail.

Contributed by Bobby Wang.

Change-Id: If17dee395ad6b9b7c738021bad20d0a13eb4011e
2021-08-02 12:58:25 +01:00
Petre Bogdan Stolojan f2cec5cb88
HADOOP-17139 Re-enable optimized copyFromLocal implementation in S3AFileSystem (#3101)
This work
* Defines the behavior of FileSystem.copyFromLocal in filesystem.md
* Implements a high performance implementation of copyFromLocalOperation
  for S3
* Adds a contract test for the operation: AbstractContractCopyFromLocalTest
* Implements the contract tests for Local and S3A FileSystems

Contributed by: Bogdan Stolojan

Change-Id: I25d502102775c3626c4264e5a14c649879730050
2021-08-02 11:58:36 +01:00
Brian Loss 37e0828e76
HADOOP-17811: ABFS ExponentialRetryPolicy doesn't pick up configuration values (#3221)
Contributed by Brian Loss.

Change-Id: I5f24196d1d02de91336c3679abaf8d55cfaed746
2021-08-02 11:37:33 +01:00
bshashikant 18bd66e5b0 HDFS-16145. CopyListing fails with FNF exception with snapshot diff. (#3234)
(cherry picked from commit dac10fcc20)
2021-07-28 09:38:06 +01:00
Petre Bogdan Stolojan e89d30b6b7
HADOOP-17458. S3A to treat "SdkClientException: Data read has a different length than the expected" as EOFException (#3040)
Some network exceptions can raise SdkClientException with message
`Data read has a different length than the expected`.

These should be recoverable.

Contributed by Bogdan Stolojan

Change-Id: Ia22fd77d90971e9e02b4f947398a4749eebe5909
2021-07-23 14:46:59 +01:00
Mehakmeet Singh 14a3e74c5c
HADOOP-17801. No error message reported when bucket doesn't exist in S3AFS (#3202)
Contributed by: Mehakmeet Singh.

Change-Id: I26c2a85ef6bbfd1b8269a23fc44d9a55d7fa091c
2021-07-16 15:36:54 +01:00
Mehakmeet Singh cd15b0cb8a HADOOP-17803. Remove WARN logging from LoggingAuditor when executing a request outside an audit span (#3207)
Followup to HADOOP-17511. "Add audit/telemetry logging to S3A connector"

Contributed by Mehakmeet Singh
2021-07-16 11:52:37 +01:00
snehavarma 11825d30e8
HADOOP-17714 ABFS: testBlobBackCompatibility, testRandomRead & WasbAbfsCompatibility tests fail when triggered with default configs (#3035) (#3126)
(cherry picked from commit 35e4c31fff)
2021-07-12 11:53:46 +05:30
snehavarma ab3809cf8d
HADOOP-17715 ABFS: Append blob tests with non HNS accounts fail (#3028) (#3125)
(cherry picked from commit 4c039fafeb)
2021-07-12 11:51:41 +05:30
sumangala-patki aa6a9cac72
HADOOP-17596. ABFS: Change default Readahead Queue Depth from num(processors) to const (#3106)
* HADOOP-17596. ABFS: Change default Readahead Queue Depth from num(processors) to const (#2795)
. Contributed by Sumangala Patki.

(cherry picked from commit 76d92eb2a2)
2021-07-10 15:09:59 +05:30
litao 7cb91db575 HDFS-16122. Fix DistCpContext#toString() (#3191). Contributed by tomscut.
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2021-07-10 13:56:36 +05:30
Mukund Thakur e8f9af6f2a
HADOOP-17250 Lot of short reads can be merged with readahead. (#3110)
Introducing fs.azure.readahead.range parameter which can be set by the user.
Data will be populated in buffer for random reads as well which leads to fewer
remote calls.

This patch also changes the seek implementation to perform a lazy seek. The
actual seek is done when a read is initiated and data is not present in the buffer else
data is returned from the buffer thus reducing the number of remote storage calls.

Contributed By: Mukund Thakur

Change-Id: Ib920eedd0087caa150afa4d4c23e89df56b29e83
2021-07-05 11:23:32 +01:00
Mehakmeet Singh f1a14df9e6
HADOOP-17774. S3A bytesRead FS statistic showing twice the correct value (#3144)
Contributed by: Mehakmeet Singh

Change-Id: I3302654ca36474a5f399aa848f88bce4587022d8
2021-07-02 14:13:26 +01:00
Zamil Majdy 80859d714d
HADOOP-17764. S3AInputStream read does not re-open the input stream on the second read retry attempt (#3109)
Contributed by Zamil Majdy.

Change-Id: I680d9c425c920ff1a7cd4764d62e10e6ac78bee4
2021-06-25 20:47:11 +01:00