This addresses HADOOP-18521, "ABFS ReadBufferManager buffer sharing
across concurrent HTTP requests" by not trying to cancel
in progress reads.
It supercedes HADOOP-18528, which disables the prefetching.
If that patch is applied *after* this one, prefetching
will be disabled.
As well as changing the default value in the code,
core-default.xml is updated to set
fs.azure.enable.readahead = true
As a result, if Configuration.get("fs.azure.enable.readahead")
returns a non-null value, then it can be inferred that
it was set in or core-default.xml (the fix is present)
or in core-site.xml (someone asked for it).
Note: this commit contains the followup commit:
HADOOP-18546. Followup: ITestReadBufferManager fix (#5198)
That is needed to avoid race conditions in the test.
Contributed by Pranav Saxena.
This allows abfs request throttling to be shared across all
abfs connections talking to containers belonging to the same abfs storage
account -as that is the level at which IO throttling is applied.
The option is enabled/disabled in the configuration option
"fs.azure.account.throttling.enabled";
The default is "true"
Contributed by Anmol Asrani
This commit parses SAS Tokens and removes the unwanted prefix of '?' from them, if present.
At present, SAS Tokens are provided to the driver through customer implementations of the SASTokenProvider interface. The SAS token providers should not assume that the token will be the first query parameter in the URIs that communicate with the backend. However, it was observed that certain public interfaces provided by Storage to generate SAS can include the '?' as the first character of the SAS Token, which would ideally be the case when it is the first query parameter. Thus, tokens that contain this prefix will lead to an error in the driver due to a clash of query parameters.
To avoid failures for use of such SAS tokens, after receiving the SAS Token from the provider, the code checks for whether any ? prefix is present or not. If yes, it is removed before further usage of the token. This way, users would not have to manually remove the prefix before passing it on as a configuration.
Contributed by Sree Bhattacharya
* Exactly 1 sending thread per an RPC connection.
* If the calling thread is interrupted before the socket write, it will be skipped instead of sending it anyways.
* If the calling thread is interrupted during the socket write, the write will finish.
* RPC requests will be written to the socket in the order received.
* Sending thread is only started by the receiving thread.
* The sending thread periodically checks the shouldCloseConnection flag.
Disables block prefetching on ABFS InputStreams, by setting
fs.azure.enable.readahead to false in core-default.xml and
the matching java constant.
This prevents
HADOOP-18521. ABFS ReadBufferManager buffer sharing across concurrent HTTP requests.
Once a fix for that is committed, this change can be reverted.
Contributed by Mehakmeet Singh.
* HADOOP-18517. ABFS: Add fs.azure.enable.readahead option to disable readahead
Adds new config option to turn off readahead
* also allows it to be passed in through openFile(),
* extends ITestAbfsReadWriteAndSeek to use the option, including one
replicated test...that shows that turning it off is slower.
Important: this does not address the critical data corruption issue
HADOOP-18521. ABFS ReadBufferManager buffer sharing across concurrent HTTP requests
What is does do is provide a way to completely bypass the ReadBufferManager.
To mitigate the problem, either fs.azure.enable.readahead needs to be set to false,
or set "fs.azure.readaheadqueue.depth" to 0 -this still goes near the (broken)
ReadBufferManager code, but does't trigger the bug.
For safe reading of files through the ABFS connector, readahead MUST be disabled
or the followup fix to HADOOP-18521 applied
Contributed by Steve Loughran
This fixes a race condition with the TemporaryAWSCredentialProvider
one which has existed for a long time but which only surfaced
(usually in Spark) when the bucket existence probe was disabled
by setting fs.s3a.bucket.probe to 0 -a performance speedup
which was made the default in HADOOP-17454.
Contributed by Jimmy Wong.
The option "fs.s3a.proxy.ssl.enabled" controls
whether the s3a connects to a proxy over HTTP (default) or HTTPS.
Set to "true" to use HTTPS.
Contributed by Mehakmeet Singh
Move construction of XML parsers in YARN
modules to using the locked-down parser factory
of HADOOP-18469.
One exception: GpuDeviceInformationParser still supports DTD resolution;
all other features are disabled.
Contributed by P J Fanning
Moves from com.sun.jersey 1.19 to the artifact
com.github.pjfanning:jersey-json:1.20
This allows jackson 1 to be removed from the classpath.
Contains
* HADOOP-16908. Prune Jackson 1 from the codebase and restrict
its usage for future
* HADOOP-18219. Fix shaded client test failure
These are needed for the HADOOP-15983 changes to build.
Contributed by PJ Fanning.
This is to try and close the underlying filesystems when the FileContext APIs are used.
Without this, threads may be leaked
Contributed by Steve Loughran
Addresses CVE-2020-15522 and CVE-2020-26939.
This can break builds with older maven shade plugins or
other code using asm.jar which is not aware of recent java bytecodes
and/or multi-release JARs. fix: use a later version of asm.jar
Contributed by PJ Fanning
The AWS SDKV2 upgrade log no longer warns about instantiation
of the v1 SDK credential providers which are commonly used in
s3a configurations:
* com.amazonaws.auth.EnvironmentVariableCredentialsProvider
* com.amazonaws.auth.EC2ContainerCredentialsProviderWrapper
* com.amazonaws.auth.InstanceProfileCredentialsProvider
When the hadoop-aws module moves to the v2 SDK, references to these
credential providers will be rewritten to their v2 equivalents.
Follow-on to HADOOP-18382. "Upgrade AWS SDK to V2 - Prerequisites"
Contributed by Ahmar Suhail
This patch prepares the hadoop-aws module for a future
migration to using the v2 AWS SDK (HADOOP-18073)
That upgrade will be incompatible; this patch prepares
for it:
-marks some credential providers and other
classes and methods as @deprecated.
-updates site documentation
-reduces the visibility of the s3 client;
other than for testing, it is kept private to
the S3AFileSystem class.
-logs some warnings when deprecated APIs are used.
The warning messages are printed only once
per JVM's life. To disable them, set the
log level of org.apache.hadoop.fs.s3a.SDKV2Upgrade
to ERROR
Contributed by Ahmar Suhail
The swift:// connector for openstack support has been removed.
The hadoop-openstack jar remains, only now it is empty of code.
This is to ensure that projects which declare the JAR a dependency
will still have successful builds.
Contributed by Steve Loughran
Add to XMLUtils a set of methods to create secure XML Parsers/transformers,
locking down DTD, schema, XXE exposure.
Use these wherever XML parsers are created.
Contributed by PJ Fanning
part of HADOOP-18103.
Also introducing a config fs.s3a.vectored.active.ranged.reads
to configure the maximum number of number of range reads a
single input stream can have active (downloading, or queued)
to the central FileSystem instance's pool of queued operations.
This stops a single stream overloading the shared thread pool.
Contributed by: Mukund Thakur
Conflicts:
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java