Commit Graph

1539 Commits

Author SHA1 Message Date
GuoPhilipse 34c415ca2e
HADOOP-18026. Fix default value of Magic committer (#3723)
Contributed by guophilipse

Change-Id: If915623c76619dd3d3b3bdf989688fa13e56fec1
2021-12-03 16:53:44 +00:00
Steve Loughran 67eaf5aa9f
HADOOP-17979. Add Interface EtagSource to allow FileStatus subclasses to provide etags (#3633)
Contributed by Steve Loughran

Change-Id: I596205d788f623114c12962941445432e2036c34
2021-11-29 16:20:55 +00:00
Steve Loughran e1267608ec
HADOOP-18002. ABFS rename idempotency broken -remove recovery (#3641)
Cut modtime-based rename recovery as object modification time
is not updated during rename operation.
Applications will have to use etag API of HADOOP-17979
and implement it themselves.

Why not do the HEAD and etag recovery in ABFS client?
Cuts the IO capacity in half so kills job commit performance.
The manifest committer of MAPREDUCE-7341 will do this recovery
and act as the reference implementation of the algorithm.

Contributed by: Steve Loughran

Change-Id: I810054c9fd05041dac552f13d31fb15d7524721b
2021-11-17 11:53:34 +00:00
Chao Sun e079fa6577 Preparing for 3.3.3 development 2021-11-16 16:02:34 -08:00
Steve Loughran 7b632dd22b Revert "HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)"
This reverts commit 0379aebafe.
2021-11-05 14:22:07 +00:00
sumangala-patki 689dd7bf17
HADOOP-17863. ABFS: Fix compiler deprecation warning in TextFileBasedIdentityHandler (#3332)
Closes #3332

Contributed by Sumangala Patki

Change-Id: I2abd33bd62bb734a431cccfc50a52bdeb2bf7db6
2021-11-05 12:55:45 +00:00
Jinhu Wu 0557da6820 HADOOP-17374. support listObjectV2 (#3587)
(cherry picked from commit a9c51ea57d)
2021-11-04 21:45:04 -07:00
sumangala-patki 0379aebafe
HADOOP-17873. ABFS: Fix transient failures in ITestAbfsStreamStatistics and ITestAbfsRestOperationException (#3341)
Addresses transient failures in the following test classes:

* ITestAbfsStreamStatistics: Uses a filesystem level static instance to record read/write statistics, which also tracks these operations in other tests running in parallel. Marked for sequential-only run to avoid transient failure

* ITestAbfsRestOperationException: The use of a static member to track retry count causes transient failures when two tests of this class happen to run together. Switch to non-static variable for assertions on retry count

closes #3341

Contributed by Sumangala Patki

Change-Id: Ied4dec35c81e94efe5f999acae4bb8fde278202e
2021-11-04 15:57:42 +00:00
Steve Loughran a68671eaf7
HADOOP-17928. Syncable: S3A to warn and downgrade (#3585)
This switches the default behavior of S3A output streams
to warning that Syncable.hsync() or hflush() have been
called; it's not considered an error unless the defaults
are overridden.

This avoids breaking applications which call the APIs,
at the risk of people trying to use S3 as a safe store
of streamed data (HBase WALs, audit logs etc).

Contributed by Steve Loughran.

Change-Id: I0a02ec1e622343619f147f94158c18928a73a885
2021-11-04 14:41:42 +00:00
Anoop Sam John 913d06ad4d
HADOOP-17770 WASB : Support disabling buffered reads in positional reads (#3233) 2021-10-22 11:45:42 +05:30
Tamas Domok e7785bb7e5
HADOOP-17974. Import statements in hadoop-aws trigger clover failures. #3572
Contributed by Tamas Domok

Change-Id: I47da62596ce23d71709c65eb493bf656967d4415
2021-10-21 18:43:54 +01:00
Mehakmeet Singh bd077c3814
HADOOP-17953. S3A: Tests to lookup global or per-bucket configuration for encryption algorithm (#3525)
Followup to S3-CSE work of HADOOP-13887

Contributed by Mehakmeet Singh
2021-10-21 12:03:50 +01:00
adol001 21bd015df2
HADOOP-17932. Distcp file length comparison have no effect (#3519)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 280ae1c0a9)
2021-10-18 19:09:09 +09:00
Viraj Jasani 77ee5a4266
HADOOP-17950. Provide replacement for deprecated APIs of commons-io IOUtils (#3515)
Signed-off-by: Akira Ajisaka <aajisaka@apache.org>
(cherry picked from commit 8071dbb9c6)
2021-10-07 11:00:19 +09:00
Steve Loughran 6f7b45641a
HADOOP-17922. move to fs.s3a.encryption.algorithm - JCEKS integration (#3466)
The ordering of the resolution of new and deprecated s3a encryption options
& secrets is the same when JCEKS and other hadoop credentials stores are used
to store them as when they are in XML files: per-bucket settings always take
priority over global values, even when the bucket-level options use the
old option names.

Contributed by Mehakmeet Singh and Steve Loughran

Change-Id: I871672071efa2eb6b600cb2658fceeef57f658a3
2021-10-05 11:39:43 +01:00
Mehakmeet Singh 769059c2f5
HADOOP-17871. S3A CSE: minor tuning (#3412)
This migrates the fs.s3a-server-side encryption configuration options
to a name which covers client-side encryption too.

fs.s3a.server-side-encryption-algorithm becomes fs.s3a.encryption.algorithm
fs.s3a.server-side-encryption.key becomes fs.s3a.encryption.key

The existing keys remain valid, simply deprecated and remapped
to the new values. If you want server-side encryption options
to be picked up regardless of hadoop versions, use
the old keys.

(the old key also works for CSE, though as no version of Hadoop
with CSE support has shipped without this remapping, it's less
relevant)

Contributed by: Mehakmeet Singh

Change-Id: I51804b21b287dbce18864f0a6ad17126aba2b281
2021-10-05 11:39:25 +01:00
Mehakmeet Singh abb367aec6
HADOOP-17817.HADOOP-17823. S3A to raise IOE if both S3-CSE and S3Guard enabled (#3239)
S3A S3Guard tests to skip if S3-CSE are enabled (#3263)

    Follow on to
    * HADOOP-13887. Encrypt S3A data client-side with AWS SDK (S3-CSE)

    If the S3A bucket is set up to use S3-CSE encryption, all tests which turn
    on S3Guard are skipped, so they don't raise any exceptions about
    incompatible configurations.

Contributed by Mehakmeet Singh

Change-Id: I9f4188109b56a1f4e5a31fae265d980c5795db1e
2021-10-05 11:38:57 +01:00
Mehakmeet Singh aee975a136
HADOOP-13887. Support S3 client side encryption (S3-CSE) using AWS-SDK (#2706)
This (big!) patch adds support for client side encryption in AWS S3,
with keys managed by AWS-KMS.

Read the documentation in encryption.md very, very carefully before
use and consider it unstable.

S3-CSE is enabled in the existing configuration option
"fs.s3a.server-side-encryption-algorithm":

fs.s3a.server-side-encryption-algorithm=CSE-KMS
fs.s3a.server-side-encryption.key=<KMS_KEY_ID>

You cannot enable CSE and SSE in the same client, although
you can still enable a default SSE option in the S3 console.

* Filesystem list/get status operations subtract 16 bytes from the length
  of all files >= 16 bytes long to compensate for the padding which CSE
  adds.
* The SDK always warns about the specific algorithm chosen being
  deprecated. It is critical to use this algorithm for ranged
  GET requests to work (i.e. random IO). Ignore.
* Unencrypted files CANNOT BE READ.
  The entire bucket SHOULD be encrypted with S3-CSE.
* Uploading files may be a bit slower as blocks are now
  written sequentially.
* The Multipart Upload API is disabled when S3-CSE is active.

Contributed by Mehakmeet Singh

Change-Id: Ie1a27a036a39db66a67e9c6d33bc78d54ea708a0
2021-10-05 11:37:41 +01:00
Josh Elser feeaebeb84
HADOOP-17934. ABFS: Make sure the AbfsHttpOperation is non-null before using it (#3477)
Contributed by: Josh Elser

Change-Id: I24a2e0322d8cae2d72d65c7f3d8a74580a418317
2021-10-04 20:54:39 +01:00
Mehakmeet Singh 8e5620cd9e
HADOOP-17195. ABFS: OutOfMemory error while uploading huge files (#3446)
Addresses the problem of processes running out of memory when
there are many ABFS output streams queuing data to upload,
especially when the network upload bandwidth is less than the rate
data is generated.

ABFS Output streams now buffer their blocks of data to
"disk", "bytebuffer" or "array", as set in
"fs.azure.data.blocks.buffer"

When buffering via disk, the location for temporary storage
is set in "fs.azure.buffer.dir"

For safe scaling: use "disk" (default); for performance, when
confident that upload bandwidth will never be a bottleneck,
experiment with the memory options.

The number of blocks a single stream can have queued for uploading
is set in "fs.azure.block.upload.active.blocks".
The default value is 20.

Contributed by Mehakmeet Singh.
2021-09-22 11:19:16 +01:00
sumangala-patki dd30db78e7
HADOOP-17290. ABFS: Add Identifiers to Client Request Header (#2520)
Contributed by Sumangala Patki.

(cherry picked from commit 35570e414a)
2021-09-21 16:45:51 +01:00
sumangala-patki 1cb9e747eb
HADOOP-17618. ABFS: Partially obfuscate SAS object IDs in Logs (#2845)
Contributed by Sumangala Patki

(cherry picked from commit 3450522c2f)
2021-09-09 14:04:12 +01:00
Steve Loughran a2242df10a
HADOOP-17894. CredentialProviderFactory.getProviders() recursion loading JCEKS file from S3A (#3393)
* CredentialProviderFactory to detect and report on recursion.
* S3AFS to remove incompatible providers.
* Integration Test for this.

Contributed by Steve Loughran.

Change-Id: Ia247b3c9fe8488ffdb7f57b40eb6e37c57e522ef
2021-09-08 17:00:20 +01:00
Mukund Thakur 3b1c594355 HADOOP-17156. ABFS: Release the byte buffers held by input streams in close() (#3285)
Contributed By: Mukund Thakur
2021-09-07 15:29:22 +05:30
Dongjoon Hyun 8606b2cddd
HADOOP-17869. `fs.s3a.connection.maximum` should be bigger than `fs.s3a.threads.max` (#3337).
The value of `fs.s3a.connection.maximum` has been increased to 96

Contributed by Dongjoon Hyun

Change-Id: I9020a2bfd2a67fa7a2ec0598ed9d63e78ee99c73
2021-08-30 18:31:57 +01:00
Steve Loughran c1ad91e72d
HADOOP-17822. fs.s3a.acl.default not working after S3A Audit feature (#3249)
Fixes the regression caused by HADOOP-17511 by moving where the
option  fs.s3a.acl.default is read -doing it before the RequestFactory
is created.

Adds

* A unit test in TestRequestFactory to verify the ACLs are set
  on all file write operations.
* A new ITestS3ACannedACLs test which verifies that ACLs really
  do get all the way through.
* S3A Assumed Role delegation tokens to include the IAM permission
  s3:PutObjectAcl in the generated role.

Contributed by Steve Loughran

Change-Id: I3abac6a1b9e150b6b6df0af7c2c70093f8f518cb
2021-08-02 15:33:34 +01:00
Steve Loughran 26514b6534 HADOOP-17628. Distcp contract test is really slow with ABFS and S3A; timing out. (#3240)
This patch cuts down the size of directory trees used for
distcp contract tests against object stores, so making
them much faster against distant/slow stores.

On abfs, the test only runs with -Dscale (as was the case for s3a already),
and has the larger scale test timeout.

After every test case, the FileSystem IOStatistics are logged,
to provide information about what IO is taking place and
what it's performance is.

There are some test cases which upload files of 1+ MiB; you can
increase the size of the upload in the option
"scale.test.distcp.file.size.kb" 
Set it to zero and the large file tests are skipped.

Contributed by Steve Loughran.
2021-08-02 12:58:37 +01:00
Bobby Wang 904cdd0b00
HADOOP-17812. NPE in S3AInputStream read() after failure to reconnect to store (#3222)
This improves error handling after multiple failures reading data
-when the read fails and attempts to reconnect() also fail.

Contributed by Bobby Wang.

Change-Id: If17dee395ad6b9b7c738021bad20d0a13eb4011e
2021-08-02 12:58:25 +01:00
Petre Bogdan Stolojan f2cec5cb88
HADOOP-17139 Re-enable optimized copyFromLocal implementation in S3AFileSystem (#3101)
This work
* Defines the behavior of FileSystem.copyFromLocal in filesystem.md
* Implements a high performance implementation of copyFromLocalOperation
  for S3
* Adds a contract test for the operation: AbstractContractCopyFromLocalTest
* Implements the contract tests for Local and S3A FileSystems

Contributed by: Bogdan Stolojan

Change-Id: I25d502102775c3626c4264e5a14c649879730050
2021-08-02 11:58:36 +01:00
Brian Loss 37e0828e76
HADOOP-17811: ABFS ExponentialRetryPolicy doesn't pick up configuration values (#3221)
Contributed by Brian Loss.

Change-Id: I5f24196d1d02de91336c3679abaf8d55cfaed746
2021-08-02 11:37:33 +01:00
bshashikant 18bd66e5b0 HDFS-16145. CopyListing fails with FNF exception with snapshot diff. (#3234)
(cherry picked from commit dac10fcc20)
2021-07-28 09:38:06 +01:00
Petre Bogdan Stolojan e89d30b6b7
HADOOP-17458. S3A to treat "SdkClientException: Data read has a different length than the expected" as EOFException (#3040)
Some network exceptions can raise SdkClientException with message
`Data read has a different length than the expected`.

These should be recoverable.

Contributed by Bogdan Stolojan

Change-Id: Ia22fd77d90971e9e02b4f947398a4749eebe5909
2021-07-23 14:46:59 +01:00
Mehakmeet Singh 14a3e74c5c
HADOOP-17801. No error message reported when bucket doesn't exist in S3AFS (#3202)
Contributed by: Mehakmeet Singh.

Change-Id: I26c2a85ef6bbfd1b8269a23fc44d9a55d7fa091c
2021-07-16 15:36:54 +01:00
Mehakmeet Singh cd15b0cb8a HADOOP-17803. Remove WARN logging from LoggingAuditor when executing a request outside an audit span (#3207)
Followup to HADOOP-17511. "Add audit/telemetry logging to S3A connector"

Contributed by Mehakmeet Singh
2021-07-16 11:52:37 +01:00
snehavarma 11825d30e8
HADOOP-17714 ABFS: testBlobBackCompatibility, testRandomRead & WasbAbfsCompatibility tests fail when triggered with default configs (#3035) (#3126)
(cherry picked from commit 35e4c31fff)
2021-07-12 11:53:46 +05:30
snehavarma ab3809cf8d
HADOOP-17715 ABFS: Append blob tests with non HNS accounts fail (#3028) (#3125)
(cherry picked from commit 4c039fafeb)
2021-07-12 11:51:41 +05:30
sumangala-patki aa6a9cac72
HADOOP-17596. ABFS: Change default Readahead Queue Depth from num(processors) to const (#3106)
* HADOOP-17596. ABFS: Change default Readahead Queue Depth from num(processors) to const (#2795)
. Contributed by Sumangala Patki.

(cherry picked from commit 76d92eb2a2)
2021-07-10 15:09:59 +05:30
litao 7cb91db575 HDFS-16122. Fix DistCpContext#toString() (#3191). Contributed by tomscut.
Signed-off-by: Ayush Saxena <ayushsaxena@apache.org>
2021-07-10 13:56:36 +05:30
Mukund Thakur e8f9af6f2a
HADOOP-17250 Lot of short reads can be merged with readahead. (#3110)
Introducing fs.azure.readahead.range parameter which can be set by the user.
Data will be populated in buffer for random reads as well which leads to fewer
remote calls.

This patch also changes the seek implementation to perform a lazy seek. The
actual seek is done when a read is initiated and data is not present in the buffer else
data is returned from the buffer thus reducing the number of remote storage calls.

Contributed By: Mukund Thakur

Change-Id: Ib920eedd0087caa150afa4d4c23e89df56b29e83
2021-07-05 11:23:32 +01:00
Mehakmeet Singh f1a14df9e6
HADOOP-17774. S3A bytesRead FS statistic showing twice the correct value (#3144)
Contributed by: Mehakmeet Singh

Change-Id: I3302654ca36474a5f399aa848f88bce4587022d8
2021-07-02 14:13:26 +01:00
Zamil Majdy 80859d714d
HADOOP-17764. S3AInputStream read does not re-open the input stream on the second read retry attempt (#3109)
Contributed by Zamil Majdy.

Change-Id: I680d9c425c920ff1a7cd4764d62e10e6ac78bee4
2021-06-25 20:47:11 +01:00
Steve Loughran 39e6f2d191
HADOOP-17771. S3AFS creation fails "Unable to find a region via the region provider chain." (#3133)
This addresses the regression in Hadoop 3.3.1 where if no S3 endpoint
is set in fs.s3a.endpoint, S3A filesystem creation may fail on
non-EC2 deployments, depending on the local host environment setup.

* If fs.s3a.endpoint is empty/null, and fs.s3a.endpoint.region
  is null, the region is set to "us-east-1".
* If fs.s3a.endpoint.region is explicitly set to "" then the client
  falls back to the SDK region resolution chain; this works on EC2
* Details in troubleshooting.md, including a workaround for Hadoop-3.3.1+
* Also contains some minor restructuring of troubleshooting.md

Contributed by Steve Loughran.

Change-Id: Ife482cff513307cd52d59eec56beac0a33e031f5
2021-06-24 16:38:55 +01:00
Takanobu Asanuma 25138c98bf HADOOP-17760. Delete hadoop.ssl.enabled and dfs.https.enable from docs and core-default.xml (#3099)
Reviewed-by: Ayush Saxena <ayushsaxena@apache.org>
(cherry picked from commit 9e7c7ad129)
2021-06-17 10:00:36 +09:00
Petre Bogdan Stolojan 254d943126
HADOOP-17547 Magic committer to downgrade abort in cleanup if list uploads fails with access denied (#3051)
Contributed by Bogdan Stolojan

Change-Id: I32d6dc4f72087783a3ea12473d11690ac14fe3cb
2021-06-12 17:46:11 +01:00
Viraj Jasani 8f0ba9ee1b
HADOOP-17725. Improve error message for token providers in ABFS (#3041)
Contributed by Viraj Jasani.
2021-06-08 22:05:01 +01:00
Akira Ajisaka 37516726d7 HDFS-16050. Some dynamometer tests fail. (#3079)
Signed-off-by: Takanobu Asanuma <tasanuma@apache.org>
(cherry picked from commit 57a3613e5d)
2021-06-07 15:03:06 +09:00
zhengchenyu 7feb41b73d
MAPREDUCE-7287. Distcp will delete exists file , If we use "-delete and -update" options and distcp file. (#2852)
Contributed by zhengchenyu

Change-Id: I61edf9a443c0c6cd5b5dd911901708530cf131ed
2021-05-28 20:27:00 +01:00
Steve Loughran 464bbd5b7c
HADOOP-17511. Add audit/telemetry logging to S3A connector (#2807)
The S3A connector supports
"an auditor", a plugin which is invoked
at the start of every filesystem API call,
and whose issued "audit span" provides a context
for all REST operations against the S3 object store.

The standard auditor sets the HTTP Referrer header
on the requests with information about the API call,
such as process ID, operation name, path,
and even job ID.

If the S3 bucket is configured to log requests, this
information will be preserved there and so can be used
to analyze and troubleshoot storage IO.

Contributed by Steve Loughran.

Change-Id: Ic0a105c194342ed2d529833ecc42608e8ba2f258
2021-05-25 12:55:38 +01:00
Mehakmeet Singh b82a0fa9e6
HADOOP-17705. S3A to add Config to set AWS region (#3020)
The option `fs.s3a.endpoint.region` can be used
to explicitly set the AWS region of a bucket.

This is needed when using AWS Private Link, as
the region cannot be automatically determined.

Contributed by Mehakmeet Singh

Change-Id: I4b52f85d7af0ddd56b2b0505ac0124d4fcc67ca0
2021-05-24 13:13:37 +01:00
Mehakmeet Singh a786847b8f
HADOOP-17670. S3AFS and ABFS to log IOStats at DEBUG mode or optionally at INFO level in close() (#2963)
When the S3A and ABFS filesystems are closed,
their IOStatistics are logged at debug in the log:

org.apache.hadoop.fs.statistics.IOStatisticsLogging

Set `fs.iostatistics.logging.level` to `info` for the statistics
to be logged at info. (also: `warn` or `error` for even higher
log levels).

Contributed by: Mehakmeet Singh

Change-Id: I56d44ad89fc1c0dd4baf701681834e7fd96c544f
2021-05-24 13:04:20 +01:00