HADOOP-14099 Split S3 testing documentation out into its own file. Contributed by Steve Loughran.

This commit is contained in:
Steve Loughran 2017-02-22 11:41:24 +00:00
parent 4428c3da69
commit 171b18693a
2 changed files with 889 additions and 571 deletions

View File

@ -34,7 +34,12 @@ data between hadoop and other applications via the S3 object store.
replacement for `s3n:`, this filesystem binding supports larger files and promises
higher performance.
The specifics of using these filesystems are documented below.
The specifics of using these filesystems are documented in this section.
See also:
* [Testing](testing.html)
* [Troubleshooting S3a](troubleshooting_s3a.html)
### Warning #1: Object Stores are not filesystems
@ -1685,30 +1690,30 @@ $ bin/hadoop fs -ls s3a://frankfurt/
WARN s3a.S3AFileSystem: Client: Amazon S3 error 400: 400 Bad Request; Bad Request (retryable)
com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 923C5D9E75E44C06), S3 Extended Request ID: HDwje6k+ANEeDsM6aJ8+D5gUmNAMguOk2BvZ8PH3g9z0gpH+IuwT7N19oQOnIr5CIx7Vqb/uThE=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:307)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:284)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2793)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:101)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:373)
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:1182)
at com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:489)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:310)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3785)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:307)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:284)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2793)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:101)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:235)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:218)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:165)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:315)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:373)
ls: doesBucketExist on frankfurt-new: com.amazonaws.services.s3.model.AmazonS3Exception:
Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request;
```
@ -2026,549 +2031,3 @@ the DNS TTL of a JVM is "infinity".
To work with AWS better, set the DNS time-to-live of an application which
works with S3 to something lower. See [AWS documentation](http://docs.aws.amazon.com/AWSSdkDocsJava/latest/DeveloperGuide/java-dg-jvm-ttl.html).
## Testing the S3 filesystem clients
This module includes both unit tests, which can run in isolation without
connecting to the S3 service, and integration tests, which require a working
connection to S3 to interact with a bucket. Unit test suites follow the naming
convention `Test*.java`. Integration tests follow the naming convention
`ITest*.java`.
Due to eventual consistency, integration tests may fail without reason.
Transient failures, which no longer occur upon rerunning the test, should thus
be ignored.
To integration test the S3* filesystem clients, you need to provide two files
which pass in authentication details to the test runner.
1. `auth-keys.xml`
1. `core-site.xml`
These are both Hadoop XML configuration files, which must be placed into
`hadoop-tools/hadoop-aws/src/test/resources`.
### `core-site.xml`
This file pre-exists and sources the configurations created
under `auth-keys.xml`.
For most purposes you will not need to edit this file unless you
need to apply a specific, non-default property change during the tests.
### `auth-keys.xml`
The presence of this file triggers the testing of the S3 classes.
Without this file, *none of the integration tests in this module will be
executed*.
The XML file must contain all the ID/key information needed to connect
each of the filesystem clients to the object stores, and a URL for
each filesystem for its testing.
1. `test.fs.s3n.name` : the URL of the bucket for S3n tests
1. `test.fs.s3a.name` : the URL of the bucket for S3a tests
2. `test.fs.s3.name` : the URL of the bucket for "S3" tests
The contents of each bucket will be destroyed during the test process:
do not use the bucket for any purpose other than testing. Furthermore, for
s3a, all in-progress multi-part uploads to the bucket will be aborted at the
start of a test (by forcing `fs.s3a.multipart.purge=true`) to clean up the
temporary state of previously failed tests.
Example:
<configuration>
<property>
<name>test.fs.s3n.name</name>
<value>s3n://test-aws-s3n/</value>
</property>
<property>
<name>test.fs.s3a.name</name>
<value>s3a://test-aws-s3a/</value>
</property>
<property>
<name>test.fs.s3.name</name>
<value>s3://test-aws-s3/</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>DONOTPCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>DONOTPCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for IAM role-based authentication.</description>
<value>DONOTCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for IAM role-based authentication.</description>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>test.sts.endpoint</name>
<description>Specific endpoint to use for STS requests.</description>
<value>sts.amazonaws.com</value>
</property>
</configuration>
### File `contract-test-options.xml`
The file `hadoop-tools/hadoop-aws/src/test/resources/contract-test-options.xml`
must be created and configured for the test filesystems.
If a specific file `fs.contract.test.fs.*` test path is not defined for
any of the filesystems, those tests will be skipped.
The standard S3 authentication details must also be provided. This can be
through copy-and-paste of the `auth-keys.xml` credentials, or it can be
through direct XInclude inclusion.
### s3://
The filesystem name must be defined in the property `fs.contract.test.fs.s3`.
Example:
<property>
<name>fs.contract.test.fs.s3</name>
<value>s3://test-aws-s3/</value>
</property>
### s3n://
In the file `src/test/resources/contract-test-options.xml`, the filesystem
name must be defined in the property `fs.contract.test.fs.s3n`.
The standard configuration options to define the S3N authentication details
must also be provided.
Example:
<property>
<name>fs.contract.test.fs.s3n</name>
<value>s3n://test-aws-s3n/</value>
</property>
### s3a://
In the file `src/test/resources/contract-test-options.xml`, the filesystem
name must be defined in the property `fs.contract.test.fs.s3a`.
The standard configuration options to define the S3N authentication details
must also be provided.
Example:
<property>
<name>fs.contract.test.fs.s3a</name>
<value>s3a://test-aws-s3a/</value>
</property>
### Complete example of `contract-test-options.xml`
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="/home/testuser/.ssh/auth-keys.xml"/>
<property>
<name>fs.contract.test.fs.s3</name>
<value>s3://test-aws-s3/</value>
</property>
<property>
<name>fs.contract.test.fs.s3a</name>
<value>s3a://test-aws-s3a/</value>
</property>
<property>
<name>fs.contract.test.fs.s3n</name>
<value>s3n://test-aws-s3n/</value>
</property>
</configuration>
This example pulls in the `~/.ssh/auth-keys.xml` file for the credentials.
This provides one single place to keep the keys up to date —and means
that the file `contract-test-options.xml` does not contain any
secret credentials itself. As the auth keys XML file is kept out of the
source code tree, it is not going to get accidentally committed.
### Configuring S3a Encryption
For S3a encryption tests to run correctly, the
`fs.s3a.server-side-encryption-key` must be configured in the s3a contract xml
file with a AWS KMS encryption key arn as this value is different for each AWS
KMS.
Example:
<property>
<name>fs.s3a.server-side-encryption-key</name>
<value>arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01</value>
</property>
You can also force all the tests to run with a specific SSE encryption method
by configuring the property `fs.s3a.server-side-encryption-algorithm` in the s3a
contract file.
### Running the Tests
After completing the configuration, execute the test run through Maven.
mvn clean verify
It's also possible to execute multiple test suites in parallel by passing the
`parallel-tests` property on the command line. The tests spend most of their
time blocked on network I/O with the S3 service, so running in parallel tends to
complete full test runs faster.
mvn -Dparallel-tests clean verify
Some tests must run with exclusive access to the S3 bucket, so even with the
`parallel-tests` property, several test suites will run in serial in a separate
Maven execution step after the parallel tests.
By default, `parallel-tests` runs 4 test suites concurrently. This can be tuned
by passing the `testsThreadCount` property.
mvn -Dparallel-tests -DtestsThreadCount=8 clean verify
To run just unit tests, which do not require S3 connectivity or AWS credentials,
use any of the above invocations, but switch the goal to `test` instead of
`verify`.
mvn clean test
mvn -Dparallel-tests clean test
mvn -Dparallel-tests -DtestsThreadCount=8 clean test
To run only a specific named subset of tests, pass the `test` property for unit
tests or the `it.test` property for integration tests.
mvn clean test -Dtest=TestS3AInputPolicies
mvn clean verify -Dit.test=ITestS3AFileContextStatistics -Dtest=none
mvn clean verify -Dtest=TestS3A* -Dit.test=ITestS3A*
Note that when running a specific subset of tests, the patterns passed in `test`
and `it.test` override the configuration of which tests need to run in isolation
in a separate serial phase (mentioned above). This can cause unpredictable
results, so the recommendation is to avoid passing `parallel-tests` in
combination with `test` or `it.test`. If you know that you are specifying only
tests that can run safely in parallel, then it will work. For wide patterns,
like `ITestS3A*` shown above, it may cause unpredictable test failures.
### Testing against different regions
S3A can connect to different regions —the tests support this. Simply
define the target region in `contract-test-options.xml` or any `auth-keys.xml`
file referenced.
```xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
</property>
```
This is used for all tests expect for scale tests using a Public CSV.gz file
(see below)
### S3A session tests
The test `TestS3ATemporaryCredentials` requests a set of temporary
credentials from the STS service, then uses them to authenticate with S3.
If an S3 implementation does not support STS, then the functional test
cases must be disabled:
<property>
<name>test.fs.s3a.sts.enabled</name>
<value>false</value>
</property>
These tests reqest a temporary set of credentials from the STS service endpoint.
An alternate endpoint may be defined in `test.fs.s3a.sts.endpoint`.
<property>
<name>test.fs.s3a.sts.endpoint</name>
<value>https://sts.example.org/</value>
</property>
The default is ""; meaning "use the amazon default value".
### CSV Data source Tests
The `TestS3AInputStreamPerformance` tests require read access to a multi-MB
text file. The default file for these tests is one published by amazon,
[s3a://landsat-pds.s3.amazonaws.com/scene_list.gz](http://landsat-pds.s3.amazonaws.com/scene_list.gz).
This is a gzipped CSV index of other files which amazon serves for open use.
The path to this object is set in the option `fs.s3a.scale.test.csvfile`,
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value>s3a://landsat-pds/scene_list.gz</value>
</property>
1. If the option is not overridden, the default value is used. This
is hosted in Amazon's US-east datacenter.
1. If `fs.s3a.scale.test.csvfile` is empty, tests which require it will be skipped.
1. If the data cannot be read for any reason then the test will fail.
1. If the property is set to a different path, then that data must be readable
and "sufficiently" large.
(the reason the space or newline is needed is to add "an empty entry"; an empty
`<value/>` would be considered undefined and pick up the default)
Of using a test file in an S3 region requiring a different endpoint value
set in `fs.s3a.endpoint`, a bucket-specific endpoint must be defined.
For the default test dataset, hosted in the `landsat-pds` bucket, this is:
```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint</name>
<value>s3.amazonaws.com</value>
<description>The endpoint for s3a://landsat-pds URLs</description>
</property>
```
To test on alternate infrastructures supporting
the same APIs, the option `fs.s3a.scale.test.csvfile` must either be
set to " ", or an object of at least 10MB is uploaded to the object store, and
the `fs.s3a.scale.test.csvfile` option set to its path.
```xml
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value> </value>
</property>
```
### Viewing Integration Test Reports
Integration test results and logs are stored in `target/failsafe-reports/`.
An HTML report can be generated during site generation, or with the `surefire-report`
plugin:
```
mvn surefire-report:failsafe-report-only
```
### Scale Tests
There are a set of tests designed to measure the scalability and performance
at scale of the S3A tests, *Scale Tests*. Tests include: creating
and traversing directory trees, uploading large files, renaming them,
deleting them, seeking through the files, performing random IO, and others.
This makes them a foundational part of the benchmarking.
By their very nature they are slow. And, as their execution time is often
limited by bandwidth between the computer running the tests and the S3 endpoint,
parallel execution does not speed these tests up.
#### Enabling the Scale Tests
The tests are enabled if the `scale` property is set in the maven build
this can be done regardless of whether or not the parallel test profile
is used
```bash
mvn verify -Dscale
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
```
The most bandwidth intensive tests (those which upload data) always run
sequentially; those which are slow due to HTTPS setup costs or server-side
actionsare included in the set of parallelized tests.
#### Maven build tuning options
Some of the tests can be tuned from the maven build or from the
configuration file used to run the tests.
```bash
mvn verify -Dscale -Dfs.s3a.scale.test.huge.filesize=128M
```
The algorithm is
1. The value is queried from the configuration file, using a default value if
it is not set.
1. The value is queried from the JVM System Properties, where it is passed
down by maven.
1. If the system property is null, empty, or it has the value `unset`, then
the configuration value is used. The `unset` option is used to
[work round a quirk in maven property propagation](http://stackoverflow.com/questions/7773134/null-versus-empty-arguments-in-maven).
Only a few properties can be set this way; more will be added.
| Property | Meaninging |
|-----------|-------------|
| `fs.s3a.scale.test.timeout`| Timeout in seconds for scale tests |
| `fs.s3a.scale.test.huge.filesize`| Size for huge file uploads |
| `fs.s3a.scale.test.huge.huge.partitionsize`| Size for partitions in huge file uploads |
The file and partition sizes are numeric values with a k/m/g/t/p suffix depending
on the desired size. For example: 128M, 128m, 2G, 2G, 4T or even 1P.
#### Scale test configuration options
Some scale tests perform multiple operations (such as creating many directories).
The exact number of operations to perform is configurable in the option
`scale.test.operation.count`
```xml
<property>
<name>scale.test.operation.count</name>
<value>10</value>
</property>
```
Larger values generate more load, and are recommended when testing locally,
or in batch runs.
Smaller values results in faster test runs, especially when the object
store is a long way away.
Operations which work on directories have a separate option: this controls
the width and depth of tests creating recursive directories. Larger
values create exponentially more directories, with consequent performance
impact.
```xml
<property>
<name>scale.test.directory.count</name>
<value>2</value>
</property>
```
DistCp tests targeting S3A support a configurable file size. The default is
10 MB, but the configuration value is expressed in KB so that it can be tuned
smaller to achieve faster test runs.
```xml
<property>
<name>scale.test.distcp.file.size.kb</name>
<value>10240</value>
</property>
```
S3A specific scale test properties are
##### `fs.s3a.scale.test.huge.filesize`: size in MB for "Huge file tests".
The Huge File tests validate S3A's ability to handle large files —the property
`fs.s3a.scale.test.huge.filesize` declares the file size to use.
```xml
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>200M</value>
</property>
```
Amazon S3 handles files larger than 5GB differently than smaller ones.
Setting the huge filesize to a number greater than that) validates support
for huge files.
```xml
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>6G</value>
</property>
```
Tests at this scale are slow: they are best executed from hosts running in
the cloud infrastructure where the S3 endpoint is based.
Otherwise, set a large timeout in `fs.s3a.scale.test.timeout`
```xml
<property>
<name>fs.s3a.scale.test.timeout</name>
<value>432000</value>
</property>
```
The tests are executed in an order to only clean up created files after
the end of all the tests. If the tests are interrupted, the test data will remain.
### Testing against non AWS S3 endpoints.
The S3A filesystem is designed to work with storage endpoints which implement
the S3 protocols to the extent that the amazon S3 SDK is capable of talking
to it. We encourage testing against other filesystems and submissions of patches
which address issues. In particular, we encourage testing of Hadoop release
candidates, as these third-party endpoints get even less testing than the
S3 endpoint itself.
**Disabling the encryption tests**
If the endpoint doesn't support server-side-encryption, these will fail
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
Encryption is only used for those specific test suites with `Encryption` in
their classname.

View File

@ -0,0 +1,859 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
# Testing the Hadoop S3 filesystem clients
<!-- MACRO{toc|fromDepth=0|toDepth=5} -->
This module includes both unit tests, which can run in isolation without
connecting to the S3 service, and integration tests, which require a working
connection to S3 to interact with a bucket. Unit test suites follow the naming
convention `Test*.java`. Integration tests follow the naming convention
`ITest*.java`.
Due to eventual consistency, integration tests may fail without reason.
Transient failures, which no longer occur upon rerunning the test, should thus
be ignored.
## Policy for submitting patches which affect the `hadoop-aws` module.
The Apache Jenkins infrastucture does not run any S3 integration tests,
due to the need to keep credentials secure.
### The submitter of any patch is required to run all the integration tests and declare which S3 region/implementation they used.
This is important: **patches which do not include this declaration will be ignored**
This policy has proven to be the only mechanism to guarantee full regression
testing of code changes. Why the declaration of region? Two reasons
1. It helps us identify regressions which only surface against specific endpoints
or third-party implementations of the S3 protocol.
1. It forces the submitters to be more honest about their testing. It's easy
to lie, "yes, I tested this". To say "yes, I tested this against S3 US-west"
is a more specific lie and harder to make. And, if you get caught out: you
lose all credibility with the project.
You don't need to test from a VM within the AWS infrastructure; with the
`-Dparallel=tests` option the non-scale tests complete in under ten minutes.
Because the tests clean up after themselves, they are also designed to be low
cost. It's neither hard nor expensive to run the tests; if you can't,
there's no guarantee your patch works. The reviewers have enough to do, and
don't have the time to do these tests, especially as every failure will simply
make for a slow iterative development.
Please: run the tests. And if you don't, we are sorry for declining your
patch, but we have to.
### What if there's an intermittent failure of a test?
Some of the tests do fail intermittently, especially in parallel runs.
If this happens, try to run the test on its own to see if the test succeeds.
If it still fails, include this fact in your declaration. We know some tests
are intermittently unreliable.
### What if the tests are timing out or failing over my network connection?
The tests and the S3A client are designed to be configurable for different
timeouts. If you are seeing problems and this configuration isn't working,
that's a sign of the configuration mechanism isn't complete. If it's happening
in the production code, that could be a sign of a problem which may surface
over long-haul connections. Please help us identify and fix these problems
&mdash; especially as you are the one best placed to verify the fixes work.
## Setting up the tests
To integration test the S3* filesystem clients, you need to provide two files
which pass in authentication details to the test runner.
1. `auth-keys.xml`
1. `contract-test-options.xml`
These are both Hadoop XML configuration files, which must be placed into
`hadoop-tools/hadoop-aws/src/test/resources`.
### File `core-site.xml`
This file pre-exists and sources the configurations created
under `auth-keys.xml`.
For most purposes you will not need to edit this file unless you
need to apply a specific, non-default property change during the tests.
### File `auth-keys.xml`
The presence of this file triggers the testing of the S3 classes.
Without this file, *none of the integration tests in this module will be
executed*.
The XML file must contain all the ID/key information needed to connect
each of the filesystem clients to the object stores, and a URL for
each filesystem for its testing.
1. `test.fs.s3n.name` : the URL of the bucket for S3n tests
1. `test.fs.s3a.name` : the URL of the bucket for S3a tests
The contents of each bucket will be destroyed during the test process:
do not use the bucket for any purpose other than testing. Furthermore, for
s3a, all in-progress multi-part uploads to the bucket will be aborted at the
start of a test (by forcing `fs.s3a.multipart.purge=true`) to clean up the
temporary state of previously failed tests.
Example:
```xml
<configuration>
<property>
<name>test.fs.s3n.name</name>
<value>s3n://test-aws-s3n/</value>
</property>
<property>
<name>test.fs.s3a.name</name>
<value>s3a://test-aws-s3a/</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>DONOTPCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>fs.s3a.access.key</name>
<description>AWS access key ID. Omit for IAM role-based authentication.</description>
<value>DONOTCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<description>AWS secret key. Omit for IAM role-based authentication.</description>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
<property>
<name>test.sts.endpoint</name>
<description>Specific endpoint to use for STS requests.</description>
<value>sts.amazonaws.com</value>
</property>
</configuration>
```
### File `contract-test-options.xml`
The file `hadoop-tools/hadoop-aws/src/test/resources/contract-test-options.xml`
must be created and configured for the test filesystems.
If a specific file `fs.contract.test.fs.*` test path is not defined for
any of the filesystems, those tests will be skipped.
The standard S3 authentication details must also be provided. This can be
through copy-and-paste of the `auth-keys.xml` credentials, or it can be
through direct XInclude inclusion.
Here is an an example `contract-test-options.xml` which places all test options
into the `auth-keys.xml` file, so offering a single place to keep credentials
and define test endpoint bindings.
```xml
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="auth-keys.xml"/>
</configuration>
```
### s3n://
In the file `src/test/resources/contract-test-options.xml`, the filesystem
name must be defined in the property `fs.contract.test.fs.s3n`.
The standard configuration options to define the S3N authentication details
must also be provided.
Example:
```xml
<property>
<name>fs.contract.test.fs.s3n</name>
<value>s3n://test-aws-s3n/</value>
</property>
```
In the file `src/test/resources/contract-test-options.xml`, the filesystem
name must be defined in the property `fs.contract.test.fs.s3a`.
The standard configuration options to define the S3N authentication details
must also be provided.
Example:
```xml
<property>
<name>fs.contract.test.fs.s3a</name>
<value>s3a://test-aws-s3a/</value>
</property>
```
### Complete example of `contract-test-options.xml`
```xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one
~ or more contributor license agreements. See the NOTICE file
~ distributed with this work for additional information
~ regarding copyright ownership. The ASF licenses this file
~ to you under the Apache License, Version 2.0 (the
~ "License"); you may not use this file except in compliance
~ with the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="auth-keys.xml"/>
<property>
<name>fs.contract.test.fs.s3</name>
<value>s3://test-aws-s3/</value>
</property>
<property>
<name>fs.contract.test.fs.s3a</name>
<value>s3a://test-aws-s3a/</value>
</property>
<property>
<name>fs.contract.test.fs.s3n</name>
<value>s3n://test-aws-s3n/</value>
</property>
</configuration>
```
This example pulls in the `auth-keys.xml` file for the credentials.
This provides one single place to keep the keys up to date —and means
that the file `contract-test-options.xml` does not contain any
secret credentials itself. As the auth keys XML file is kept out of the
source code tree, it is not going to get accidentally committed.
### Configuring S3a Encryption
For S3a encryption tests to run correctly, the
`fs.s3a.server-side-encryption-key` must be configured in the s3a contract xml
file with a AWS KMS encryption key arn as this value is different for each AWS
KMS.
Example:
```xml
<property>
<name>fs.s3a.server-side-encryption-key</name>
<value>arn:aws:kms:us-west-2:360379543683:key/071a86ff-8881-4ba0-9230-95af6d01ca01</value>
</property>
```
You can also force all the tests to run with a specific SSE encryption method
by configuring the property `fs.s3a.server-side-encryption-algorithm` in the s3a
contract file.
## Running the Tests
After completing the configuration, execute the test run through Maven.
```bash
mvn clean verify
```
It's also possible to execute multiple test suites in parallel by passing the
`parallel-tests` property on the command line. The tests spend most of their
time blocked on network I/O with the S3 service, so running in parallel tends to
complete full test runs faster.
```bash
mvn -Dparallel-tests clean verify
```
Some tests must run with exclusive access to the S3 bucket, so even with the
`parallel-tests` property, several test suites will run in serial in a separate
Maven execution step after the parallel tests.
By default, `parallel-tests` runs 4 test suites concurrently. This can be tuned
by passing the `testsThreadCount` property.
```bash
mvn -Dparallel-tests -DtestsThreadCount=8 clean verify
```
To run just unit tests, which do not require S3 connectivity or AWS credentials,
use any of the above invocations, but switch the goal to `test` instead of
`verify`.
```bash
mvn clean test
mvn -Dparallel-tests clean test
mvn -Dparallel-tests -DtestsThreadCount=8 clean test
```
To run only a specific named subset of tests, pass the `test` property for unit
tests or the `it.test` property for integration tests.
```bash
mvn clean test -Dtest=TestS3AInputPolicies
mvn clean verify -Dit.test=ITestS3AFileContextStatistics -Dtest=none
mvn clean verify -Dtest=TestS3A* -Dit.test=ITestS3A*
```
Note that when running a specific subset of tests, the patterns passed in `test`
and `it.test` override the configuration of which tests need to run in isolation
in a separate serial phase (mentioned above). This can cause unpredictable
results, so the recommendation is to avoid passing `parallel-tests` in
combination with `test` or `it.test`. If you know that you are specifying only
tests that can run safely in parallel, then it will work. For wide patterns,
like `ITestS3A*` shown above, it may cause unpredictable test failures.
### Testing against different regions
S3A can connect to different regions —the tests support this. Simply
define the target region in `contract-test-options.xml` or any `auth-keys.xml`
file referenced.
```xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.eu-central-1.amazonaws.com</value>
</property>
```
This is used for all tests expect for scale tests using a Public CSV.gz file
(see below)
### CSV Data source Tests
The `TestS3AInputStreamPerformance` tests require read access to a multi-MB
text file. The default file for these tests is one published by amazon,
[s3a://landsat-pds.s3.amazonaws.com/scene_list.gz](http://landsat-pds.s3.amazonaws.com/scene_list.gz).
This is a gzipped CSV index of other files which amazon serves for open use.
The path to this object is set in the option `fs.s3a.scale.test.csvfile`,
```xml
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value>s3a://landsat-pds/scene_list.gz</value>
</property>
```
1. If the option is not overridden, the default value is used. This
is hosted in Amazon's US-east datacenter.
1. If `fs.s3a.scale.test.csvfile` is empty, tests which require it will be skipped.
1. If the data cannot be read for any reason then the test will fail.
1. If the property is set to a different path, then that data must be readable
and "sufficiently" large.
(the reason the space or newline is needed is to add "an empty entry"; an empty
`<value/>` would be considered undefined and pick up the default)
Of using a test file in an S3 region requiring a different endpoint value
set in `fs.s3a.endpoint`, a bucket-specific endpoint must be defined.
For the default test dataset, hosted in the `landsat-pds` bucket, this is:
```xml
<property>
<name>fs.s3a.bucket.landsat-pds.endpoint</name>
<value>s3.amazonaws.com</value>
<description>The endpoint for s3a://landsat-pds URLs</description>
</property>
```
### Viewing Integration Test Reports
Integration test results and logs are stored in `target/failsafe-reports/`.
An HTML report can be generated during site generation, or with the `surefire-report`
plugin:
```bash
mvn surefire-report:failsafe-report-only
```
### Scale Tests
There are a set of tests designed to measure the scalability and performance
at scale of the S3A tests, *Scale Tests*. Tests include: creating
and traversing directory trees, uploading large files, renaming them,
deleting them, seeking through the files, performing random IO, and others.
This makes them a foundational part of the benchmarking.
By their very nature they are slow. And, as their execution time is often
limited by bandwidth between the computer running the tests and the S3 endpoint,
parallel execution does not speed these tests up.
#### Enabling the Scale Tests
The tests are enabled if the `scale` property is set in the maven build
this can be done regardless of whether or not the parallel test profile
is used
```bash
mvn verify -Dscale
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8
```
The most bandwidth intensive tests (those which upload data) always run
sequentially; those which are slow due to HTTPS setup costs or server-side
actionsare included in the set of parallelized tests.
#### Maven build tuning options
Some of the tests can be tuned from the maven build or from the
configuration file used to run the tests.
```bash
mvn verify -Dparallel-tests -Dscale -DtestsThreadCount=8 -Dfs.s3a.scale.test.huge.filesize=128M
```
The algorithm is
1. The value is queried from the configuration file, using a default value if
it is not set.
1. The value is queried from the JVM System Properties, where it is passed
down by maven.
1. If the system property is null, an empty string, or it has the value `unset`,
then the configuration value is used. The `unset` option is used to
[work round a quirk in maven property propagation](http://stackoverflow.com/questions/7773134/null-versus-empty-arguments-in-maven).
Only a few properties can be set this way; more will be added.
| Property | Meaninging |
|-----------|-------------|
| `fs.s3a.scale.test.timeout`| Timeout in seconds for scale tests |
| `fs.s3a.scale.test.huge.filesize`| Size for huge file uploads |
| `fs.s3a.scale.test.huge.huge.partitionsize`| Size for partitions in huge file uploads |
The file and partition sizes are numeric values with a k/m/g/t/p suffix depending
on the desired size. For example: 128M, 128m, 2G, 2G, 4T or even 1P.
#### Scale test configuration options
Some scale tests perform multiple operations (such as creating many directories).
The exact number of operations to perform is configurable in the option
`scale.test.operation.count`
```xml
<property>
<name>scale.test.operation.count</name>
<value>10</value>
</property>
```
Larger values generate more load, and are recommended when testing locally,
or in batch runs.
Smaller values results in faster test runs, especially when the object
store is a long way away.
Operations which work on directories have a separate option: this controls
the width and depth of tests creating recursive directories. Larger
values create exponentially more directories, with consequent performance
impact.
```xml
<property>
<name>scale.test.directory.count</name>
<value>2</value>
</property>
```
DistCp tests targeting S3A support a configurable file size. The default is
10 MB, but the configuration value is expressed in KB so that it can be tuned
smaller to achieve faster test runs.
```xml
<property>
<name>scale.test.distcp.file.size.kb</name>
<value>10240</value>
</property>
```
S3A specific scale test properties are
##### `fs.s3a.scale.test.huge.filesize`: size in MB for "Huge file tests".
The Huge File tests validate S3A's ability to handle large files —the property
`fs.s3a.scale.test.huge.filesize` declares the file size to use.
```xml
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>200M</value>
</property>
```
Amazon S3 handles files larger than 5GB differently than smaller ones.
Setting the huge filesize to a number greater than that) validates support
for huge files.
```xml
<property>
<name>fs.s3a.scale.test.huge.filesize</name>
<value>6G</value>
</property>
```
Tests at this scale are slow: they are best executed from hosts running in
the cloud infrastructure where the S3 endpoint is based.
Otherwise, set a large timeout in `fs.s3a.scale.test.timeout`
```xml
<property>
<name>fs.s3a.scale.test.timeout</name>
<value>432000</value>
</property>
```
The tests are executed in an order to only clean up created files after
the end of all the tests. If the tests are interrupted, the test data will remain.
## Testing against non AWS S3 endpoints.
The S3A filesystem is designed to work with storage endpoints which implement
the S3 protocols to the extent that the amazon S3 SDK is capable of talking
to it. We encourage testing against other filesystems and submissions of patches
which address issues. In particular, we encourage testing of Hadoop release
candidates, as these third-party endpoints get even less testing than the
S3 endpoint itself.
### Disabling the encryption tests
If the endpoint doesn't support server-side-encryption, these will fail. They
can be turned off.
```xml
<property>
<name>test.fs.s3a.encryption.enabled</name>
<value>false</value>
</property>
```
Encryption is only used for those specific test suites with `Encryption` in
their classname.
### Configuring the CSV file read tests**
To test on alternate infrastructures supporting
the same APIs, the option `fs.s3a.scale.test.csvfile` must either be
set to " ", or an object of at least 10MB is uploaded to the object store, and
the `fs.s3a.scale.test.csvfile` option set to its path.
```xml
<property>
<name>fs.s3a.scale.test.csvfile</name>
<value> </value>
</property>
```
(yes, the space is necessary. The Hadoop `Configuration` class treats an empty
value as "do not override the default").
### Testing Session Credentials
The test `TestS3ATemporaryCredentials` requests a set of temporary
credentials from the STS service, then uses them to authenticate with S3.
If an S3 implementation does not support STS, then the functional test
cases must be disabled:
```xml
<property>
<name>test.fs.s3a.sts.enabled</name>
<value>false</value>
</property>
```
These tests reqest a temporary set of credentials from the STS service endpoint.
An alternate endpoint may be defined in `test.fs.s3a.sts.endpoint`.
```xml
<property>
<name>test.fs.s3a.sts.endpoint</name>
<value>https://sts.example.org/</value>
</property>
```
The default is ""; meaning "use the amazon default value".
## Debugging Test failures
Logging at debug level is the standard way to provide more diagnostics output;
after setting this rerun the tests
```properties
log4j.logger.org.apache.hadoop.fs.s3a=DEBUG
```
There are also some logging options for debug logging of the AWS client
```properties
log4j.logger.com.amazonaws=DEBUG
log4j.logger.com.amazonaws.http.conn.ssl=INFO
log4j.logger.com.amazonaws.internal=INFO
```
There is also the option of enabling logging on a bucket; this could perhaps
be used to diagnose problems from that end. This isn't something actively
used, but remains an option. If you are forced to debug this way, consider
setting the `fs.s3a.user.agent.prefix` to a unique prefix for a specific
test run, which will enable the specific log entries to be more easily
located.
## Adding new tests
New tests are always welcome. Bear in mind that we need to keep costs
and test time down, which is done by
* Not duplicating tests.
* Being efficient in your use of Hadoop API calls.
* Isolating large/slow tests into the "scale" test group.
* Designing all tests to execute in parallel (where possible).
* Adding new probes and predicates into existing tests, albeit carefully.
*No duplication*: if an operation is tested elsewhere, don't repeat it. This
applies as much for metadata operations as it does for bulk IO. If a new
test case is added which completely obsoletes an existing test, it is OK
to cut the previous one —after showing that coverage is not worsened.
*Efficient*: prefer the `getFileStatus()` and examining the results, rather than
call to `exists()`, `isFile()`, etc.
*Isolating Scale tests*. Any S3A test doing large amounts of IO MUST extend the
class `S3AScaleTestBase`, so only running if `scale` is defined on a build,
supporting test timeouts configurable by the user. Scale tests should also
support configurability as to the actual size of objects/number of operations,
so that behavior at different scale can be verified.
*Designed for parallel execution*. A key need here is for each test suite to work
on isolated parts of the filesystem. Subclasses of `AbstractS3ATestBase`
SHOULD use the `path()` method, with a base path of the test suite name, to
build isolated paths. Tests MUST NOT assume that they have exclusive access
to a bucket.
*Extending existing tests where appropriate*. This recommendation goes
against normal testing best practise of "test one thing per method".
Because it is so slow to create directory trees or upload large files, we do
not have that luxury. All the tests against real S3 endpoints are integration
tests where sharing test setup and teardown saves time and money.
A standard way to do this is to extend existing tests with some extra predicates,
rather than write new tests. When doing this, make sure that the new predicates
fail with meaningful diagnostics, so any new problems can be easily debugged
from test logs.
### Requirements of new Tests
This is what we expect from new tests; they're an extension of the normal
Hadoop requirements, based on the need to work with remote servers whose
use requires the presence of secret credentials, where tests may be slow,
and where finding out why something failed from nothing but the test output
is critical.
#### Subclasses Existing Shared Base Blasses
Extend `AbstractS3ATestBase` or `AbstractSTestS3AHugeFiles` unless justifiable.
These set things up for testing against the object stores, provide good threadnames,
help generate isolated paths, and for `AbstractSTestS3AHugeFiles` subclasses,
only run if `-Dscale` is set.
Key features of `AbstractS3ATestBase`
* `getFileSystem()` returns the S3A Filesystem bonded to the contract test Filesystem
defined in `fs.s3a.contract.test`
* will automatically skip all tests if that URL is unset.
* Extends `AbstractFSContractTestBase` and `Assert` for all their methods.
Having shared base classes may help reduce future maintenance too. Please
use them/
#### Secure
Don't ever log credentials. The credential tests go out of their way to
not provide meaningful logs or assertion messages precisely to avoid this.
#### Efficient of Time and Money
This means efficient in test setup/teardown, and, ideally, making use of
existing public datasets to save setup time and tester cost.
Strategies of particular note are:
1. `ITestS3ADirectoryPerformance`: a single test case sets up the directory
tree then performs different list operations, measuring the time taken.
1. `AbstractSTestS3AHugeFiles`: marks the test suite as
`@FixMethodOrder(MethodSorters.NAME_ASCENDING)` then orders the test cases such
that each test case expects the previous test to have completed (here: uploaded a file,
renamed a file, ...). This provides for independent tests in the reports, yet still
permits an ordered sequence of operations. Do note the use of `Assume.assume()`
to detect when the preconditions for a single test case are not met, hence,
the tests become skipped, rather than fail with a trace which is really a false alarm.
The ordered test case mechanism of `AbstractSTestS3AHugeFiles` is probably
the most elegant way of chaining test setup/teardown.
Regarding reusing existing data, we tend to use the landsat archive of
AWS US-East for our testing of input stream operations. This doesn't work
against other regions, or with third party S3 implementations. Thus the
URL can be overridden for testing elsewhere.
#### Works With Other S3 Endpoints
Don't assume AWS S3 US-East only, do allow for working with external S3 implementations.
Those may be behind the latest S3 API features, not support encryption, session
APIs, etc.
### Works Over Long-haul Links
As well as making file size and operation counts scaleable, this includes
making test timeouts adequate. The Scale tests make this configurable; it's
hard coded to ten minutes in `AbstractS3ATestBase()`; subclasses can
change this by overriding `getTestTimeoutMillis()`.
Equally importantly: support proxies, as some testers need them.
### Provides Diagnostics and timing information
1. Give threads useful names.
1. Create logs, log things. Know that the `S3AFileSystem` and its input
and output streams *all* provide useful statistics in their {{toString()}}
calls; logging them is useful on its own.
1. you can use `AbstractS3ATestBase.describe(format-stringm, args)` here.; it
adds some newlines so as to be easier to spot.
1. Use `ContractTestUtils.NanoTimer` to measure the duration of operations,
and log the output.
#### Fails Meaningfully
The `ContractTestUtils` class contains a whole set of assertions for making
statements about the expected state of a filesystem, e.g.
`assertPathExists(FS, path)`, `assertPathDoesNotExists(FS, path)`, and others.
These do their best to provide meaningful diagnostics on failures (e.g. directory
listings, file status, ...), so help make failures easier to understand.
At the very least, do not use `assertTrue()` or `assertFalse()` without
including error messages.
### Cleans Up Afterwards
Keeps costs down.
1. Do not only cleanup if a test case completes successfully; test suite
teardown must do it.
1. That teardown code must check for the filesystem and other fields being
null before the cleanup. Why? If test setup fails, the teardown methods still
get called.
### Works Reliably
We really appreciate this &mdash; you will too.
## Testing s3://
The configuration tests must declare the test bucket `test.fs.s3.name` and
the credentials for the s3:// filesystem, and the contract test bucket
`fs.contract.test.fs.s3`
### s3://
The filesystem name must be defined in the property `fs.contract.test.fs.s3`.
The same bucket name can be used for all tests
Example:
```xml
<property>
<name>test.fs.s3.name</name>
<value>s3://test-aws-s3/</value>
</property>
<property>
<name>fs.contract.test.fs.s3</name>
<value>${test.fs.s3.name}</value>
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>DONOTPCOMMITTHISKEYTOSCM</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>DONOTEVERSHARETHISSECRETKEY!</value>
</property>
```
## Tips
### How to keep your credentials really safe
Although the `auth-keys.xml` file is marged as ignored in git and subversion,
it is still in your source tree, and there's always that risk that it may
creep out.
You can avoid this by keeping your keys outside the source tree and
using an absolute XInclude reference to it.
```xml
<configuration>
<include xmlns="http://www.w3.org/2001/XInclude"
href="file:///users/ubuntu/.auth-keys.xml" />
</configuration>
```