Commit Graph

126 Commits

Author SHA1 Message Date
Armin Braun 3e2dfc6eac
Remove GCS Bucket Exists Check (#60899) (#60914)
Same as https://github.com/elastic/elasticsearch/pull/43288 for GCS.
We don't need to do the bucket exists check before using the repo, that just needlessly
increases the necessary permissions for using the GCS repository.
2020-08-11 09:54:27 +02:00
Armin Braun 7ae9dc2092
Unify Stream Copy Buffer Usage (#56078) (#60608)
We have various ways of copying between two streams and handling thread-local
buffers throughout the codebase. This commit unifies a number of them and
removes buffer allocations in many spots.
2020-08-04 09:54:52 +02:00
Jake Landis 96b7122917
[7.x] Convert repository-* from integTest to [yaml | java]RestTest or internalClusterTest (#60085) (#60404)
For OSS plugins that being with repository-*, integTest
task is now a no-op and all of the tests are now executed via a test,
yamlRestTest, javaRestTest, or internalClusterTest.

related: #56841
related: #59444
2020-07-29 11:19:44 -05:00
Armin Braun ebb6677815
Formalize and Streamline Buffer Sizes used by Repositories (#59771) (#60051)
Due to complicated access checks (reads and writes execute in their own access context) on some repositories (GCS, Azure, HDFS), using a hard coded buffer size of 4k for restores was needlessly inefficient.
By the same token, the use of stream copying with the default 8k buffer size  for blob writes was inefficient as well.

We also had dedicated, undocumented buffer size settings for HDFS and FS repositories. For these two we would use a 100k buffer by default. We did not have such a setting for e.g. GCS though, which would only use an 8k read buffer which is needlessly small for reading from a raw `URLConnection`.

This commit adds an undocumented setting that sets the default buffer size to `128k` for all repositories. It removes wasteful allocation of such a large buffer for small writes and reads in case of HDFS and FS repositories (i.e. still using the smaller buffer to write metadata) but uses a large buffer for doing restores and uploading segment blobs.

This should speed up Azure and GCS restores and snapshots in a non-trivial way as well as save some memory when reading small blobs on FS and HFDS repositories.
2020-07-22 21:06:31 +02:00
Armin Braun d18b434e62
Remove Artificially Low Chunk Size Limits from GCS + Azure Blob Stores (#59279) (#59564)
Removing these limits as they cause unnecessarily many object in the blob stores.
We do not have to worry about BwC of this change since we do not support any 3rd party
implementations of Azure or GCS.
Also, since there is no valid reason to set a different than the default maximum chunk size at this
point, removing the documentation (which was incorrect in the case of Azure to begin with) for the setting
from the docs.

Closes #56018
2020-07-14 22:31:07 +02:00
Armin Braun 9268b25789
Add Check for Metadata Existence in BlobStoreRepository (#59141) (#59216)
In order to ensure that we do not write a broken piece of `RepositoryData`
because the phyiscal repository generation was moved ahead more than one step
by erroneous concurrent writing to a repository we must check whether or not
the current assumed repository generation exists in the repository physically.
Without this check we run the risk of writing on top of stale cached repository data.

Relates #56911
2020-07-08 14:25:01 +02:00
Yannick Welsch 15c85b29fd
Account for recovery throttling when restoring snapshot (#58658) (#58811)
Restoring from a snapshot (which is a particular form of recovery) does not currently take recovery throttling into account
(i.e. the `indices.recovery.max_bytes_per_sec` setting). While restores are subject to their own throttling (repository
setting `max_restore_bytes_per_sec`), this repository setting does not allow for values to be configured differently on a
per-node basis. As restores are very similar in nature to peer recoveries (streaming bytes to the node), it makes sense to
configure throttling in a single place.

The `max_restore_bytes_per_sec` setting is also changed to default to unlimited now, whereas previously it was set to
`40mb`, which is the current default of `indices.recovery.max_bytes_per_sec`). This means that no behavioral change
will be observed by clusters where the recovery and restore settings were not adapted.

Relates https://github.com/elastic/elasticsearch/issues/57023

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
2020-07-01 12:19:29 +02:00
markharwood e2c0c4197f Mute GoogleCloudStorageRepositoryClientYamlTestSuiteIT
For #57115
2020-06-03 13:25:31 +01:00
Tanguy Leroux b4a2cd810a
Use 3rd party task to run integration tests on external service (#56588)
Backport of #56587 for 7.x
2020-06-02 11:26:58 +02:00
Armin Braun be6fa72432
Fix GCS Mock Behavior for Missing Bucket (#57283) (#57310)
* Fix GCS Mock Behavior for Missing Bucket

We were throwing a 500 instead of a 404 for a missing bucket.
This would make yaml tests needlessly wait for multiple seconds, retrying
the 500 response with backoff, in the test checking behavior for missing buckets.
2020-05-29 10:01:20 +02:00
Francisco Fernández Castaño 60c7832141
Track upload requests on S3 repositories (#56904)
Add tracking for regular and multipart uploads.
Regular uploads are categorized as PUT.
Multi part uploads are categorized as POST.
The number of documents created for the test #testRequestStats
have been increased so all upload methods are exercised.

Backport of #56826
2020-05-18 19:05:17 +02:00
Francisco Fernández Castaño 8ab9fc10c1
Track multipart/resumable uploads GCS API calls (#56892)
Add tracking for multipart and resumable uploads for GoogleCloudStorage.
For resumable uploads only the last request is taken into account for
billing, so that's the only request that's tracked.

Backport of #56821
2020-05-18 13:39:26 +02:00
Francisco Fernández Castaño 97bf47f5b9
Track GET/LIST GoogleCloudStorage API calls (#56758)
Backporting #56585 to 7.x branch.

Adds tracking for the API calls performed by the GoogleCloudStorage
underlying SDK. It hooks an HttpResponseInterceptor to the SDK
transport layer and does http request filtering based on the URI
paths that we are interested to track. Unfortunately we cannot hook
a wrapper into the ServiceRPC interface since we're using different
levels of abstraction to implement retries during reads
(GoogleCloudStorageRetryingInputStream).
2020-05-14 14:03:21 +02:00
Yannick Welsch ba39c261e8 Use streaming reads for GCS (#55506)
To read from GCS repositories we're currently using Google SDK's official BlobReadChannel,
which issues a new request every 2MB (default chunk size for BlobReadChannel) using range
requests, and fully downloads the chunk before exposing it to the returned InputStream. This
means that the SDK issues an awfully high number of requests to download large blobs.
Increasing the chunk size is not an option, as that will mean that an awfully high amount of
heap memory will be consumed by the download process.

The Google SDK does not provide the right abstractions for a streaming download. This PR
uses the lower-level primitives of the SDK to implement a streaming download, similar to what
S3's SDK does.

Also closes #55505
2020-04-21 13:22:26 +02:00
Ignacio Vera 4783f1894c
mute test testReadRangeBlobWithRetries (#55507) (#55508) 2020-04-21 10:59:35 +02:00
Yannick Welsch b9da307cd1 Add GCS support for searchable snapshots (#55403)
Adds ranged read support for GCS repositories in order to enable searchable snapshot support
for GCS.

As part of this PR, I've extracted some of the test infrastructure to make sure that
GoogleCloudStorageBlobContainerRetriesTests and S3BlobContainerRetriesTests are covering
similar test (as I saw those diverging in what they cover)
2020-04-20 13:02:59 +02:00
Jason Tedor 0a1b566c65
Fix security manager bug writing large blobs to GCS (#55421)
* Fix security manager bug writing large blobs to GCS

This commit addresses a security manager permissions issue writing large
blobs (on the resumable upload path) to GCS. The underlying issue here
is that we need to wrap the close and write calls on the channel. It is
not enough to do this:

SocketAccess.doPrivilegedVoidIOException(
  () -> Streams.copy(
    inputStream,
    Channels.newOutputStream(client().writer(blobInfo, writeOptions))));

This reason that this is not enough is because Streams#copy will be in
the stacktrace and it is not granted the security manager permissions
needed to close or write this channel. We only grant those permissions
to classes loaded in the plugin classloader, and Streams#copy is from
the parent classloader. This is why we must wrap the close and write
calls as privileged, to truncate the Streams#copy call out of the
stacktrace.

The reason that this issue is not caught in testing is because the size
of data that we use in testing is too small to trigger the large blob
resumable upload path. Therefore, we address this by adding a system
property to control the threshold, which we can then set in tests to
exercise this code path. Prior to rewriting the writeBlobResumable
method to wrap the close and write calls as privileged, with this
additional test, we are able to reproduce the security manager
permissions issue. After adding the wrapping, this test now passes.

* Fix forbidden APIs issue

* Remove leftover debugging
2020-04-17 18:49:10 -04:00
Armin Braun 73ab3719e8
Mute GCS Retry Tests on JDK8 (#55372)
Same as #53119 but for the retries tests.
Closes #55317
2020-04-17 12:19:35 +02:00
Jason Tedor 5fcda57b37
Rename MetaData to Metadata in all of the places (#54519)
This is a simple naming change PR, to fix the fact that "metadata" is a
single English word, and for too long we have not followed general
naming conventions for it. We are also not consistent about it, for
example, METADATA instead of META_DATA if we were trying to be
consistent with MetaData (although METADATA is correct when considered
in the context of "metadata"). This was a simple find and replace across
the code base, only taking a few minutes to fix this naming issue
forever.
2020-03-31 17:24:38 -04:00
Armin Braun 204c366a4e
Upgrade GCS SDK to 1.104.0 (#52839) (#53152)
Upgrading the GCS SDK to the most recent version.
Adjusting (i.e. improving) the REST mock accordingly.
This should significantly boost performance by pulling in
https://github.com/googleapis/java-core/issues/86 in some cases.
2020-03-05 11:18:18 +01:00
Tanguy Leroux 52d4807f8d
Mute GoogleCloudStorageBlobStoreRepositoryTests on jdk8 (#53119)
Tests in GoogleCloudStorageBlobStoreRepositoryTests are known 
to be flaky on JDK 8 (#51446, #52430 ) and we suspect a JDK 
bug (https://bugs.openjdk.java.net/browse/JDK-8180754) that triggers
 some assertion on the server side logic that emulates the Google 
Cloud Storage service.

Sadly we were not able to reproduce the failures, even when using 
the same OS (Debian 9, Ubuntu 16.04) and JDK (Oracle Corporation 
1.8.0_241 [Java HotSpot(TM) 64-Bit Server VM 25.241-b07]) of 
almost all the test failures on CI. While we spent some time fixing 
code (#51933, #52431) to circumvent the JDK bug they are still flaky 
on JDK-8. This commit mute these tests for JDK-8 only.

Close ##52906
2020-03-05 09:18:05 +01:00
Lee Hinman a47e404732 Mute GoogleCloudStorageBlobStoreRepositoryTests (#52926)
These intermittently fail due to an assertion triggered by a JDK bug.

Relates to #52906
2020-02-27 15:16:48 -07:00
Armin Braun 5a7db0c520
Fix GCS Test testReadLargeBlobWithRetries (#52619) (#52624)
The countdown didn't work well here because it only returns `true` once the countdown reaches `0`
but can on subsequent executions return `false` again if a countdown at `0` is counted down again,
leading to more than the expected number of simulated failures.

Closes #52607
2020-02-21 10:34:53 +01:00
Mark Vieira 4bce9984e6
Mute GoogleCloudStorageBlobContainerRetriesTests.testReadLargeBlobWithRetries
Signed-off-by: Mark Vieira <portugee@gmail.com>
2020-02-20 15:13:34 -08:00
Armin Braun aeb7b777e6
Add Blob Download Retries to GCS Repository (#52479) (#52521)
* Add Blob Download Retries to GCS Repository

Exactly as #46589 (and kept as close to it as possible code wise so we can dry things up in a follow-up potentially) but for GCS.

Closes #52319
2020-02-19 18:29:13 +01:00
Armin Braun a9c7557ac4
Fix Failure to Drain Stream in GCS Repo Tests (#52431) (#52454)
Same as #51933 but for the custom handler just used in this test.

Closes #52430
2020-02-18 11:37:34 +01:00
Armin Braun 74e3694234
Optimize GCS Repo Uploads (#51596) (#51618)
For small uploads (that can still be up to 5MB!) we needlessly
reading the `InputStream` into a BAOS which entailed allocating
the `byte[]` for the stream contents twice (because to `toByteArray` on the BAOS copies).

Also, for resumeable uploads we were needlessly wrapping the output channel and running each individual write in its own privileged context when we could just wrap the whole upload in a single privileged context.

Relates #51593
2020-01-29 16:07:30 +01:00
Armin Braun 7914c1a734
Optimize GCS Mock (#51593) (#51594)
This test was still very GC heavy in Java 8 runs in particular
which seems to slow down request processing to the point of timeouts
in some runs.
This PR completely removes the large number of O(MB) `byte[]` allocations
that were happening in the mock http handler which cuts the allocation rate
by about a factor of 5 in my local testing for the GC heavy `testSnapshotWithLargeSegmentFiles`
run.

Closes #51446
Closes #50754
2020-01-29 11:06:05 +01:00
Armin Braun 4a7e09f624
Enforce Logging of Errors in GCS Rest RetriesTests (#50761) (#50783)
It's impossible to tell why #50754 fails without this change.
We're failing to close the `exchange` somewhere and there is no
write timeout in the GCS SDK (something to look into separately)
only a read timeout on the socket so if we're failing on an assertion without
reading the full request body (at least into the read-buffer) we're locking up
waiting forever on `write0`.

This change ensure the `exchange` is closed in the tests where we could lock up
on a write and logs the failure so we can find out what broke #50754.
2020-01-09 10:46:07 +01:00
Armin Braun 761d6e8e4b
Remove BlobContainer Tests against Mocks (#50194) (#50220)
* Remove BlobContainer Tests against Mocks

Removing all these weird mocks as asked for by #30424.
All these tests are now part of real repository ITs and otherwise left unchanged if they had
independent tests that didn't call the `createBlobStore` method previously.
The HDFS tests also get added coverage as a side-effect because they did not have an implementation
of the abstract repository ITs.

Closes #30424
2019-12-16 11:37:09 +01:00
Armin Braun 6eee41e253
Remove Unused Single Delete in BlobStoreRepository (#50024) (#50123)
* Remove Unused Single Delete in BlobStoreRepository

There are no more production uses of the non-bulk delete or the delete that throws
on missing so this commit removes both these methods.
Only the bulk delete logic remains. Where the bulk delete was derived from single deletes,
the single delete code was inlined into the bulk delete method.
Where single delete was used in tests it was replaced by bulk deleting.
2019-12-12 11:17:46 +01:00
Armin Braun d19c8db4e4
Fix GCS Mock Batch Delete Behavior (#50034) (#50084)
Batch deletes get a response for every delete request, not just those that actually hit an existing blob.
The fact that we only responded for existing blobs leads to a degenerate response that throws a parse exception if a batch delete only contains non-existant blobs.
2019-12-11 17:40:25 +01:00
Armin Braun 813b49adb4
Make BlobStoreRepository Aware of ClusterState (#49639) (#49711)
* Make BlobStoreRepository Aware of ClusterState (#49639)

This is a preliminary to #49060.

It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in #49060
2019-11-29 14:57:47 +01:00
Armin Braun 3862400270
Remove Redundant EsBlobStoreTestCase (#49603) (#49605)
All the implementations of `EsBlobStoreTestCase` use the exact same
bootstrap code that is also used by their implementation of
`EsBlobStoreContainerTestCase`.
This means all tests might as well live under `EsBlobStoreContainerTestCase`
saving a lot of code duplication. Also, there was no HDFS implementation for
`EsBlobStoreTestCase` which is now automatically resolved by moving the tests over
since there is a HDFS implementation for the container tests.
2019-11-26 20:57:19 +01:00
Armin Braun 495b543e63
Improve Stability of GCS Mock API (#49592) (#49597)
Same as #49518 pretty much but for GCS.
Fixing a few more spots where input stream can get closed
without being fully drained and adding assertions to make sure
it's always drained.
Moved the no-close stream wrapper to production code utilities since
there's a number of spots in production code where it's also useful
(will reuse it there in a follow-up).
2019-11-26 16:53:51 +01:00
Tanguy Leroux f753fa2265 HttpHandlers should return correct list of objects (#49283)
This commit fixes the server side logic of "List Objects" operations
of Azure and S3 fixtures. Until today, the fixtures were returning a "
flat" view of stored objects and were not correctly handling the
delimiter parameter. This causes some objects listing to be wrongly
interpreted by the snapshot deletion logic in Elasticsearch which
relies on the ability to list child containers of BlobContainer (#42653)
to correctly delete stale indices.

As a consequence, the blobs were not correctly deleted from the
 emulated storage service and stayed in heap until they got garbage
collected, causing CI failures like #48978.

This commit fixes the server side logic of Azure and S3 fixture when
listing objects so that it now return correct common blob prefixes as
expected by the snapshot deletion process. It also adds an after-test
check to ensure that tests leave the repository empty (besides the
root index files).

Closes #48978
2019-11-20 09:26:42 +01:00
Tanguy Leroux 8a14ea5567
Add docker-composed based test fixture for GCS (#48902)
Similarly to what has be done for Azure in #48636, this commit
adds a new :test:fixtures:gcs-fixture project which provides two
docker-compose based fixtures that emulate a Google Cloud
Storage service.

Some code has been extracted from existing tests and placed
into this new project so that it can be easily reused in other
projects.
2019-11-07 13:27:22 -05:00
Andrey Ershov 088988bb37
GCS snapshot cleanup tool backport to 7.x (#48750)
This is the backport of #45076 with dependent changes.
2019-10-31 18:21:36 +03:00
Tanguy Leroux e1dd0e753d Differentiate service account tokens in GCS tests (#48382)
This commit changes the test so that each node use a specific 
service account and private key. It also changes how unique 
request ids are generated for refresh token request using the 
token itself, so that error count will be specific per node (each 
node should execute a single refresh token request as tokens 
are valid for 1 hour).
2019-10-23 16:57:35 +02:00
Tanguy Leroux 6986d7f968 Add blob container retries tests for Google Cloud Storage (#46968)
Similarly to what has been done for S3 in #45383, this commit 
adds unit tests that verify the behavior of the SDK client and 
blob container implementation for Google Storage when the 
remote service returns errors.

The main purpose was to add an extra test to the specific retry 
logic for 410-Gone errors added in #45963.

Relates #45963
2019-09-24 08:58:24 +02:00
Tanguy Leroux add7148f3b GCS deleteBlobsIgnoringIfNotExists should catch StorageException (#46832)
GoogleCloudStorageBlobStore.deleteBlobsIgnoringIfNotExists() does
not correctly catch StorageException thrown by batch.submit().

In the case a snapshot is deleted through BlobStoreRepository.deleteSnapshot()
a storage exception is not caught (only IOException are) so the deletion is
interrupted and indices cannot be cleaned up. The storage exception bubbles
up to SnapshotService.deleteSnapshotFromRepository() but the listener that
 removes the deletion from the cluster state is not executed, leaving the
deletion in the cluster state.

This bug has been reported in #46772 where batch.submit() threw an
exception in the test testIndicesDeletedFromRepository and following
tests failed because a snapshot deletion was running.

Relates #46772
2019-09-20 10:02:23 +02:00
Tanguy Leroux 3ae51f25dd Move testSnapshotWithLargeSegmentFiles to ESMockAPIBasedRepositoryIntegTestCase (#46802)
This commit moves the common test testSnapshotWithLargeSegmentFiles 
to the ESMockAPIBasedRepositoryIntegTestCase base class.
2019-09-18 15:41:30 +02:00
Tanguy Leroux 4db37801d0 Add resumable uploads support to GCS repository integration tests (#46562)
This commit adds support for resumable uploads to the internal HTTP 
server used in GoogleCloudStorageBlobStoreRepositoryTests. This 
way we can also test the behavior of the Google's client when the 
service returns server errors in response to resumable upload requests.

The BlobStore implementation for GCS has the choice between 2 
methods to upload a blob: resumable and multipart. In the current 
implementation, the client executes a resumable upload if the blob 
size is larger than LARGE_BLOB_THRESHOLD_BYTE_SIZE, 
otherwise it executes a multipart upload. This commit makes this 
logic overridable in tests, allowing to randomize the decision of 
using one method or the other.

The commit add support for single request resumable uploads 
and chunked resumable uploads (the blob is uploaded into multiple 
2Mb chunks; each chunk being a resumable upload). For this last 
case, this PR also adds a test testSnapshotWithLargeSegmentFiles 
which makes it more probable that a chunked resumable upload is 
executed.
2019-09-18 09:33:05 +02:00
Armin Braun 371c355bca
Retry GCS Resumable Upload on Error 410 (#45963) (#46783)
A resumable upload session can fail on with a 410 error and should
be retried in that case. I added retrying twice using resetting of
the given `InputStream` as the retry mechanism since the same
approach is used by the AWS S3 SDK already as well and relied upon
by the S3 repository implementation.

Related GCS documentation:
https://cloud.google.com/storage/docs/json_api/v1/status-codes#410_Gone
2019-09-17 19:06:43 +02:00
Tanguy Leroux 88bed09119 Mutualize code in cloud-based repository integration tests (#46483)
This commit factors out some common code between the cloud-based
repository integration tests that were recently improved.

Relates #46376
2019-09-09 16:02:14 +02:00
Tanguy Leroux 8e3dc68454 Inject random server errors in GoogleCloudStorageBlobStoreRepositoryTests (#46376)
This commit modifies the HTTP server used in 
GoogleCloudStorageBlobStoreRepositoryTests so that it randomly 
returns server errors. The test does not inject server errors for the 
following types of request: batch request, resumable upload request.
2019-09-09 09:59:59 +02:00
Tanguy Leroux 28974b5723 Replace mocked client in GCSBlobStoreRepositoryTests by HTTP server (#46255)
This commit removes the usage of MockGoogleCloudStoragePlugin in
GoogleCloudStorageBlobStoreRepositoryTests and replaces it by a
HttpServer that emulates the Storage service. This allows the repository
tests to use the real Google's client under the hood in tests and will allow
us to test the behavior of the snapshot/restore feature for GCS repositories
by simulating random server-side internal errors.

The HTTP server used to emulate the Storage service is intentionally simple
and minimal to keep things understandable and maintainable. Testing full
client options on the server side (like authentication, chunked encoding
etc) remains the responsibility of the GoogleCloudStorageFixture.
2019-09-05 10:37:37 +02:00
Tanguy Leroux 9e14ffa8be Few clean ups in ESBlobStoreRepositoryIntegTestCase (#46068) 2019-08-28 16:29:46 +02:00
Jason Tedor 3d64605075
Remove node settings from blob store repositories (#45991)
This commit starts from the simple premise that the use of node settings
in blob store repositories is a mistake. Here we see that the node
settings are used to get default settings for store and restore throttle
rates. Yet, since there are not any node settings registered to this
effect, there can never be a default setting to fall back to there, and
so we always end up falling back to the default rate. Since this was the
only use of node settings in blob store repository, we move them. From
this, several places fall out where we were chaining settings through
only to get them to the blob store repository, so we clean these up as
well. That leaves us with the changeset in this commit.
2019-08-26 16:26:13 -04:00
Armin Braun 6aaee8aa0a
Repository Cleanup Endpoint (#43900) (#45780)
* Repository Cleanup Endpoint (#43900)

* Snapshot cleanup functionality via transport/REST endpoint.
* Added all the infrastructure for this with the HLRC and node client
* Made use of it in tests and resolved relevant TODO
* Added new `Custom` CS element that tracks the cleanup logic.
Kept it similar to the delete and in progress classes and gave it
some (for now) redundant way of handling multiple cleanups but only allow one
* Use the exact same mechanism used by deletes to have the combination
of CS entry and increment in repository state ID provide some
concurrency safety (the initial approach of just an entry in the CS
was not enough, we must increment the repository state ID to be safe
against concurrent modifications, otherwise we run the risk of "cleaning up"
blobs that just got created without noticing)
* Isolated the logic to the transport action class as much as I could.
It's not ideal, but we don't need to keep any state and do the same
for other repository operations
(like getting the detailed snapshot shard status)
2019-08-21 17:59:49 +02:00