Commit Graph

1564 Commits

Author SHA1 Message Date
Daniel Carl Jones b749438a8c
HADOOP-14661. Add S3 requester pays bucket support to S3A (#3962)
Adds the option fs.s3a.requester.pays.enabled, which, if set to true, allows
the client to access S3 buckets where the requester is billed for the IO.

Contributed by Daniel Carl Jones

Change-Id: I51f64d0f9b3be3c4ec493bcf91927fca3b20407a
2022-03-23 20:13:20 +00:00
Steve Loughran ebdd8771c5
HADOOP-17851. S3A to support user-specified content encoding (#3498)
The option fs.s3a.object.content.encoding declares the content encoding to be set on files when they are written; this is served up in the "Content-Encoding" HTTP header when reading objects back in.

This is useful for people loading the data into other tools in the AWS ecosystem which don't use file extensions to infer compression type (e.g. serving compressed files from S3 or importing into RDS)

Contributed by: Holden Karau

Change-Id: Ice0da75b516370f51f79e45f391d46c5c7aa4ce4
2022-03-23 20:05:28 +00:00
Steve Loughran 105e0dbd92
HADOOP-13704. Optimized S3A getContentSummary()
Optimize the scan for s3 by performing a deep tree listing,
inferring directory counts from the paths returned.

Contributed by Ahmar Suhail.

Change-Id: I26ffa8c6f65fd11c68a88d6e2243b0eac6ffd024
2022-03-22 13:31:13 +00:00
Steve Loughran 3238bdab89
HADOOP-18163. hadoop-azure support for the Manifest Committer of MAPREDUCE-7341
Follow-on patch to MAPREDUCE-7341, adding ABFS support and tests

* resilient rename
* tests for job commit through the manifest committer.

contains
- HADOOP-17976. ABFS etag extraction inconsistent between LIST and HEAD calls
- HADOOP-16204. ABFS tests to include terasort

Contributed by Steve Loughran.

Change-Id: I0a7d4043bdf19bcb00c033fc389730109b93b77f
2022-03-17 11:47:15 +00:00
Mukund Thakur e0619b702a HADOOP-18112: Implement paging during multi object delete. (#4045)
Multi object delete of size more than 1000 is not supported by S3 and 
fails with MalformedXML error. So implementing paging of requests to 
reduce the number of keys in a single request. Page size can be configured
using "fs.s3a.bulk.delete.page.size" 

 Contributed By: Mukund Thakur
2022-03-11 13:16:51 +05:30
Mehakmeet Singh 909048d87d
HADOOP-18150. Fix ITestAuditManagerDisabled test in S3A. (#4044)
Contributed by Mehakmeet Singh

Change-Id: I25c10844e4ad64b1fd7af9a02018220a611c85e0
2022-03-03 18:46:28 +00:00
Steve Loughran 36a50ba3e0
HADOOP-18075. ABFS: Fix failure caused by listFiles() in ITestAbfsRestOperationException (#4040)
Contributed by Sumangala Patki

Change-Id: I245c08dab050d59b90ac6fdcb4c03153db77be0b
2022-03-01 13:48:39 +00:00
sumangala-patki 0ed0375413
HADOOP-17862. ABFS: Fix unchecked cast compiler warning for AbfsListStatusRemoteIterator (#3331)
closes #3331

Contributed by Sumangala Patki

Change-Id: I6cca91c8bcc34052c5233035f14a576f23086067
2022-03-01 13:48:39 +00:00
sumangala-patki 5e109705ef
HADOOP-17765. ABFS: Use Unique File Paths in Tests. (#3153)
Contributed by Sumangala Patki

Change-Id: Ic8f34bf578069504f7a811a7729982b9c9f49729
2022-03-01 12:29:03 +00:00
Sumangala Patki a1319e2404
HADOOP-18071. ABFS: Set driver global timeout for ITestAzureBlobFileSystemBasics (#3866)
Contributed by Sumangala Patki

Change-Id: I05f0cd1f0bd277b90f06a71345c46bfde48d7e7e
2022-02-23 21:30:39 +00:00
Ayush Saxena 5b47b9f360
HADOOP-18096. Distcp: Sync moves filtered file to home directory rather than deleting. (#3940). Contributed by Ayush Saxena.
Reviewed-by: Steve Loughran <stevel@apache.org>
Reviewed-by: stack <stack@apache.org>
2022-02-11 02:05:14 +05:30
Steve Loughran 088684ec60
HADOOP-18091. S3A auditing leaks memory through ThreadLocal references (#3930)
Adds a new map type WeakReferenceMap, which stores weak
references to values, and a WeakReferenceThreadMap subclass
to more closely resemble a thread local type, as it is a
map of threadId to value.

Construct it with a factory method and optional callback
for notification on loss and regeneration.

 WeakReferenceThreadMap<WrappingAuditSpan> activeSpan =
      new WeakReferenceThreadMap<>(
          (k) -> getUnbondedSpan(),
          this::noteSpanReferenceLost);

This is used in ActiveAuditManagerS3A for span tracking.

Relates to
* HADOOP-17511. Add an Audit plugin point for S3A
* HADOOP-18094. Disable S3A auditing by default.

Contributed by Steve Loughran.

Change-Id: Ibf7bb082fd47298f7ebf46d92f56e80ca9b2aaf8
2022-02-10 12:33:40 +00:00
Joey Krabacher 84de16028d
HADOOP-18114. Documentation correction in assumed_roles.md (#3949)
Fixes typo in hadoop-aws/assumed_roles.md

Contributed by Joey Krabacher

Change-Id: I2b77bd7793ae0433196b77042d5f400d0bcbe745
2022-02-09 10:47:24 +00:00
Petre Bogdan Stolojan 87ff57765a
HADOOP-18085. S3 SDK Upgrade causes AccessPoint ARN endpoint mistranslation (#3902)
Part of HADOOP-17198. Support S3 Access Points.

HADOOP-18068. "upgrade AWS SDK to 1.12.132" broke the access point endpoint
translation.

Correct endpoints should start with "s3-accesspoint.", after SDK upgrade they start with
"s3.accesspoint-" which messes up tests + region detection by the SDK.

Contributed by Bogdan Stolojan

Change-Id: I0c0181628ab803afc39036003777eaec79aa378c
2022-02-04 16:22:24 +00:00
Petre Bogdan Stolojan a8d7acf1a8
HADOOP-17951. Improve S3A checking of S3 Access Point existence (#3516)
Follow-on to HADOOP-17198. Support S3 Access Points

Contributed by Bogdan Stolojan

Change-Id: I0932476c64e1967eb0cb3e0f00060fac5d2bae72
2022-02-04 16:22:04 +00:00
Petre Bogdan Stolojan 664075f35d
HADOOP-17198. Support S3 Access Points (#3260)
Add support for S3 Access Points. This provides extra security as it
ensures applications are not working with buckets belong to third parties.

To bind a bucket to an access point, set the access point (ap) ARN,
which must be done for each specific bucket, using the pattern

fs.s3a.bucket.$BUCKET.accesspoint.arn = ARN

* The global/bucket option `fs.s3a.accesspoint.required` to
mandate that buckets must declare their access point.
* This is not compatible with S3Guard.

Consult the documentation for further details.

Contributed by Bogdan Stolojan

(this commit contains the changes to TestArnResource from HADOOP-18068,
 "upgrade AWS SDK to 1.12.132" so that it works with the later SDK.)

Change-Id: I3fac213e52ca6ec1c813effb8496c353964b8e1b
2022-02-04 16:21:35 +00:00
Steve Loughran 4fd0389153
HADOOP-18094. Disable S3A auditing by default.
See HADOOP-18091. S3A auditing leaks memory through ThreadLocal references

* Adds a new option fs.s3a.audit.enabled to controls whether or not auditing
is enabled. This is false by default.

* When false, the S3A auditing manager is NoopAuditManagerS3A,
which was formerly only used for unit tests and
during filsystem initialization.

* When true, ActiveAuditManagerS3A is used for managing auditing,
allowing auditing events to be reported.

* updates documentation and tests.

This patch does not fix the underlying leak. When auditing is enabled,
long-lived threads will retain references to the audit managers
of S3A filesystem instances which have already been closed.

Contributed by Steve Loughran.

Change-Id: I671e594cd59e8ca77a1f65be791ad0ae9530b8d9
2022-01-24 14:04:23 +00:00
Anmol Asrani 9b221b9599
HADOOP-18084. ABFS: Add testfilePath while verifying test contents are read correctly (#3903)
Contributed by: Anmol Asrani

Change-Id: I6e71bf349a74032f453398c7ae66f9c3305be190
2022-01-19 10:18:05 +00:00
Steve Loughran 8ccc586af6
HADOOP-17409. Remove s3guard from S3A module (#3534)
Completely removes S3Guard support from the S3A codebase.

If the connector is configured to use any metastore other than
the null and local stores (i.e. DynamoDB is selected) the s3a client
will raise an exception and refuse to initialize.

This is to ensure that there is no mix of S3Guard enabled and disabled
deployments with the same configuration but different hadoop releases
-it must be turned off completely.

The "hadoop s3guard" command has been retained -but the supported
subcommands have been reduced to those which are not purely S3Guard
related: "bucket-info" and "uploads".

This is major change in terms of the number of files
changed; before cherry picking subsequent s3a patches into
older releases, this patch will probably need backporting
first.

Goodbye S3Guard, your work is done. Time to die.

Contributed by Steve Loughran.
2022-01-18 18:04:48 +00:00
Steve Loughran 47ba977ca9
HADOOP-18068. upgrade AWS SDK to 1.12.132 (#3864)
With this update, the versions of key shaded dependencies are

  jackson    2.12.3
  httpclient 4.5.13

This backport patch does not include the TestArn changes needed
for the test to work with this version of the SDK; it is only
to be applied to branches without HADOOP-17198. "Support S3 Access Points".
If that patch is backported later, that test suite MUST be
updated to the latest version.

Contributed by Steve Loughran

Change-Id: I8d2b71781ee8472b16469531f9cd0de32dd3356f
2022-01-18 12:20:12 +00:00
monthonk 7dd8e900f8
HADOOP-14334. S3 SSEC tests to downgrade when running against a mandatory encryption object store (#3870)
Contributed by Monthon Klongklaew

Change-Id: Ib275c9690bbc90170c6a442ded198fe006c20bc1
2022-01-09 18:06:27 +00:00
Ayush Saxena 5edb33b5ed
HADOOP-18056. DistCp: Filter duplicates in the source paths. (#3825). Contributed by Ayush Saxena.
Reviewed-by: tomscut <litao@bigo.sg>
Reviewed-by: Steve Loughran <stevel@apache.org>
2022-01-05 23:53:55 +05:30
Akira Ajisaka 7cd52000e0 HADOOP-18045. Disable TestDynamometerInfra (#3829)
Reviewed-by: Fei Hui <feihui.ustc@gmail.com>
(cherry picked from commit dba139cd0f)
2021-12-28 13:23:05 +09:00
Anoop Sam John 9a1c8d2f41
HADOOP-17643 WASB : Make metadata checks case insensitive (#3103) 2021-12-10 10:44:31 +05:30
Akira Ajisaka 35c5c6bb83 HADOOP-18040. Use maven.test.failure.ignore instead of ignoreTestFailure (#3774)
Reviewed-by: Masatake Iwasaki <iwasakims@apache.org>
(cherry picked from commit 9b9e2ef87f)

 Conflicts:
	hadoop-tools/hadoop-federation-balance/pom.xml
2021-12-10 01:38:26 +09:00
GuoPhilipse 34c415ca2e
HADOOP-18026. Fix default value of Magic committer (#3723)
Contributed by guophilipse

Change-Id: If915623c76619dd3d3b3bdf989688fa13e56fec1
2021-12-03 16:53:44 +00:00
Steve Loughran 67eaf5aa9f
HADOOP-17979. Add Interface EtagSource to allow FileStatus subclasses to provide etags (#3633)
Contributed by Steve Loughran

Change-Id: I596205d788f623114c12962941445432e2036c34
2021-11-29 16:20:55 +00:00
Steve Loughran e1267608ec
HADOOP-18002. ABFS rename idempotency broken -remove recovery (#3641)
Cut modtime-based rename recovery as object modification time
is not updated during rename operation.
Applications will have to use etag API of HADOOP-17979
and implement it themselves.

Why not do the HEAD and etag recovery in ABFS client?
Cuts the IO capacity in half so kills job commit performance.
The manifest committer of MAPREDUCE-7341 will do this recovery
and act as the reference implementation of the algorithm.

Contributed by: Steve Loughran

Change-Id: I810054c9fd05041dac552f13d31fb15d7524721b
2021-11-17 11:53:34 +00:00
Chao Sun e079fa6577 Preparing for 3.3.3 development 2021-11-16 16:02:34 -08:00
Steve Loughran 7b632dd22b Revert "HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)"
This reverts commit 0379aebafe.
2021-11-05 14:22:07 +00:00
sumangala-patki 689dd7bf17
HADOOP-17863. ABFS: Fix compiler deprecation warning in TextFileBasedIdentityHandler (#3332)
Closes #3332

Contributed by Sumangala Patki

Change-Id: I2abd33bd62bb734a431cccfc50a52bdeb2bf7db6
2021-11-05 12:55:45 +00:00
Jinhu Wu 0557da6820 HADOOP-17374. support listObjectV2 (#3587)
(cherry picked from commit a9c51ea57d)
2021-11-04 21:45:04 -07:00
sumangala-patki 0379aebafe
HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)
Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes #3341

Contributed by Sumangala Patki

Change-Id: Ied4dec35c81e94efe5f999acae4bb8fde278202e
2021-11-04 15:57:42 +00:00
Steve Loughran a68671eaf7
HADOOP-17928. Syncable: S3A to warn and downgrade (#3585)
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.

This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).

Contributed by Steve Loughran.

Change-Id: I0a02ec1e622343619f147f94158c18928a73a885
2021-11-04 14:41:42 +00:00
Anoop Sam John 913d06ad4d
HADOOP-17770 WASB : Support disabling buffered reads in positional reads (#3233) 2021-10-22 11:45:42 +05:30
Tamas Domok e7785bb7e5
HADOOP-17974. Import statements in hadoop-aws trigger clover failures. #3572
Contributed by Tamas Domok

Change-Id: I47da62596ce23d71709c65eb493bf656967d4415
2021-10-21 18:43:54 +01:00
Mehakmeet Singh bd077c3814
HADOOP-17953. S3A: Tests to lookup global or per-bucket configuration for encryption algorithm (#3525)
Followup to S3-CSE work of HADOOP-13887

Contributed by Mehakmeet Singh
2021-10-21 12:03:50 +01:00
adol001 21bd015df2
HADOOP-17932. Distcp file length comparison have no effect (#3519)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 280ae1c0a9)
2021-10-18 19:09:09 +09:00
Viraj Jasani 77ee5a4266
HADOOP-17950. Provide replacement for deprecated APIs of commons-io IOUtils (#3515)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 8071dbb9c6)
2021-10-07 11:00:19 +09:00
Steve Loughran 6f7b45641a
HADOOP-17922. move to fs.s3a.encryption.algorithm - JCEKS integration (#3466)
The ordering of the resolution of new and deprecated s3a encryption options
& secrets is the same when JCEKS and other hadoop credentials stores are used
to store them as when they are in XML files: per-bucket settings always take
priority over global values, even when the bucket-level options use the
old option names.

Contributed by Mehakmeet Singh and Steve Loughran

Change-Id: I871672071efa2eb6b600cb2658fceeef57f658a3
2021-10-05 11:39:43 +01:00
Mehakmeet Singh 769059c2f5
HADOOP-17871. S3A CSE: minor tuning (#3412)
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.

fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key

The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.

(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)

Contributed by: Mehakmeet Singh

Change-Id: I51804b21b287dbce18864f0a6ad17126aba2b281
2021-10-05 11:39:25 +01:00
Mehakmeet Singh abb367aec6
HADOOP-17817.HADOOP-17823. S3A to raise IOE if both S3-CSE and S3Guard enabled (#3239)
S3A S3Guard tests to skip if S3-CSE are enabled (#3263)

    Follow on to
    * HADOOP-13887. Encrypt S3A data client-side with AWS SDK (S3-CSE)

    If the S3A bucket is set up to use S3-CSE encryption, all tests which turn
    on S3Guard are skipped, so they don't raise any exceptions about
    incompatible configurations.

Contributed by Mehakmeet Singh

Change-Id: I9f4188109b56a1f4e5a31fae265d980c5795db1e
2021-10-05 11:38:57 +01:00
Mehakmeet Singh aee975a136
HADOOP-13887. Support S3 client side encryption (S3-CSE) using AWS-SDK (#2706)
This (big!) patch adds support for client side encryption in AWS S3,
with keys managed by AWS-KMS.

Read the documentation in encryption.md very, very carefully before
use and consider it unstable.

S3-CSE is enabled in the existing configuration option
"fs.s3a.server-side-encryption-algorithm":

fs.s3a.server-side-encryption-algorithm=CSE-KMS
fs.s3a.server-side-encryption.key=<KMS_KEY_ID>

You cannot enable CSE and SSE in the same client, although
you can still enable a default SSE option in the S3 console.

* Filesystem list/get status operations subtract 16 bytes from the length
  of all files >= 16 bytes long to compensate for the padding which CSE
  adds.
* The SDK always warns about the specific algorithm chosen being
  deprecated. It is critical to use this algorithm for ranged
  GET requests to work (i.e. random IO). Ignore.
* Unencrypted files CANNOT BE READ.
  The entire bucket SHOULD be encrypted with S3-CSE.
* Uploading files may be a bit slower as blocks are now
  written sequentially.
* The Multipart Upload API is disabled when S3-CSE is active.

Contributed by Mehakmeet Singh

Change-Id: Ie1a27a036a39db66a67e9c6d33bc78d54ea708a0
2021-10-05 11:37:41 +01:00
Josh Elser feeaebeb84
HADOOP-17934. ABFS: Make sure the AbfsHttpOperation is non-null before using it (#3477)
Contributed by: Josh Elser

Change-Id: I24a2e0322d8cae2d72d65c7f3d8a74580a418317
2021-10-04 20:54:39 +01:00
Mehakmeet Singh 8e5620cd9e
HADOOP-17195. ABFS: OutOfMemory error while uploading huge files (#3446)
Addresses the problem of processes running out of memory when
there are many ABFS output streams queuing data to upload,
especially when the network upload bandwidth is less than the rate
data is generated.

ABFS Output streams now buffer their blocks of data to
"disk", "bytebuffer" or "array", as set in
"fs.azure.data.blocks.buffer"

When buffering via disk, the location for temporary storage
is set in "fs.azure.buffer.dir"

For safe scaling: use "disk" (default); for performance, when
confident that upload bandwidth will never be a bottleneck,
experiment with the memory options.

The number of blocks a single stream can have queued for uploading
is set in "fs.azure.block.upload.active.blocks".
The default value is 20.

Contributed by Mehakmeet Singh.
2021-09-22 11:19:16 +01:00
sumangala-patki dd30db78e7
HADOOP-17290. ABFS: Add Identifiers to Client Request Header (#2520)
Contributed by Sumangala Patki.

(cherry picked from commit 35570e414a)
2021-09-21 16:45:51 +01:00
sumangala-patki 1cb9e747eb
HADOOP-17618. ABFS: Partially obfuscate SAS object IDs in Logs (#2845)
Contributed by Sumangala Patki

(cherry picked from commit 3450522c2f)
2021-09-09 14:04:12 +01:00
Steve Loughran a2242df10a
HADOOP-17894. CredentialProviderFactory.getProviders() recursion loading JCEKS file from S3A (#3393)
* CredentialProviderFactory to detect and report on recursion.
* S3AFS to remove incompatible providers.
* Integration Test for this.

Contributed by Steve Loughran.

Change-Id: Ia247b3c9fe8488ffdb7f57b40eb6e37c57e522ef
2021-09-08 17:00:20 +01:00
Mukund Thakur 3b1c594355 HADOOP-17156. ABFS: Release the byte buffers held by input streams in close() (#3285)
Contributed By: Mukund Thakur
2021-09-07 15:29:22 +05:30
Dongjoon Hyun 8606b2cddd
HADOOP-17869. `fs.s3a.connection.maximum` should be bigger than `fs.s3a.threads.max` (#3337).
The value of `fs.s3a.connection.maximum` has been increased to 96

Contributed by Dongjoon Hyun

Change-Id: I9020a2bfd2a67fa7a2ec0598ed9d63e78ee99c73
2021-08-30 18:31:57 +01:00