661 lines
37 KiB
HTML
661 lines
37 KiB
HTML
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<!--
|
|
| Generated by Apache Maven Doxia at 2023-03-02
|
|
| Rendered using Apache Maven Stylus Skin 1.5
|
|
-->
|
|
<html xmlns="http://www.w3.org/1999/xhtml">
|
|
<head>
|
|
<title>Apache Hadoop Amazon Web Services support – S3A Prefetching</title>
|
|
<style type="text/css" media="all">
|
|
@import url("../../css/maven-base.css");
|
|
@import url("../../css/maven-theme.css");
|
|
@import url("../../css/site.css");
|
|
</style>
|
|
<link rel="stylesheet" href="../../css/print.css" type="text/css" media="print" />
|
|
<meta name="Date-Revision-yyyymmdd" content="20230302" />
|
|
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
|
|
</head>
|
|
<body class="composite">
|
|
<div id="banner">
|
|
<a href="http://hadoop.apache.org/" id="bannerLeft">
|
|
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
|
|
</a>
|
|
<a href="http://www.apache.org/" id="bannerRight">
|
|
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
|
|
</a>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
<div id="breadcrumbs">
|
|
|
|
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
|
|
|
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
|
|
|
|
| Last Published: 2023-03-02
|
|
| Version: 3.4.0-SNAPSHOT
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
<div id="leftColumn">
|
|
<div id="navcolumn">
|
|
|
|
<h5>General</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../index.html">Overview</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Common</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-kms/index.html">Hadoop KMS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
|
|
</li>
|
|
</ul>
|
|
<h5>HDFS</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
|
|
</li>
|
|
</ul>
|
|
<h5>MapReduce</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
|
|
</li>
|
|
</ul>
|
|
<h5>MapReduce REST APIs</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
|
|
</li>
|
|
</ul>
|
|
<h5>YARN</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
|
|
</li>
|
|
</ul>
|
|
<h5>YARN REST APIs</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
|
|
</li>
|
|
</ul>
|
|
<h5>YARN Service</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Hadoop Compatible File Systems</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-azure/index.html">Azure Blob Storage</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Auth</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-auth/index.html">Overview</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-auth/Examples.html">Examples</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-auth/Configuration.html">Configuration</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-auth/BuildingIt.html">Building</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Tools</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-distcp/DistCp.html">DistCp</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-gridmix/GridMix.html">GridMix</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-rumen/Rumen.html">Rumen</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Reference</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../api/index.html">Java API docs</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
|
|
</li>
|
|
</ul>
|
|
<h5>Configuration</h5>
|
|
<ul>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-kms/kms-default.html">kms-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
|
|
</li>
|
|
<li class="none">
|
|
<a href="../../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
|
|
</li>
|
|
</ul>
|
|
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
|
|
<img alt="Built by Maven" src="../../images/logos/maven-feather.png"/>
|
|
</a>
|
|
|
|
</div>
|
|
</div>
|
|
<div id="bodyColumn">
|
|
<div id="contentBox">
|
|
<!---
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License. See accompanying LICENSE file.
|
|
-->
|
|
<h1>S3A Prefetching</h1>
|
|
<p>This document explains the <code>S3PrefetchingInputStream</code> and the various components it uses.</p>
|
|
<p>This input stream implements prefetching and caching to improve read performance of the input stream. A high level overview of this feature was published in <a class="externalLink" href="https://medium.com/pinterest-engineering/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0">Pinterest Engineering’s blog post titled “Improving efficiency and reducing runtime using S3 read optimization”</a>.</p>
|
|
<p>With prefetching, the input stream divides the remote file into blocks of a fixed size, associates buffers to these blocks and then reads data into these buffers asynchronously. It also potentially caches these blocks.</p><section><section>
|
|
<h3><a name="Basic_Concepts"></a>Basic Concepts</h3>
|
|
<ul>
|
|
|
|
<li><b>Remote File</b>: A binary blob of data stored on some storage device.</li>
|
|
<li><b>Block File</b>: Local file containing a block of the remote file.</li>
|
|
<li><b>Block</b>: A file is divided into a number of blocks. The size of the first n-1 blocks is same, and the size of the last block may be same or smaller.</li>
|
|
<li><b>Block based reading</b>: The granularity of read is one block. That is, either an entire block is read and returned or none at all. Multiple blocks may be read in parallel.</li>
|
|
</ul></section><section>
|
|
<h3><a name="Configuring_the_stream"></a>Configuring the stream</h3>
|
|
<table border="0" class="bodyTable">
|
|
<thead>
|
|
|
|
<tr class="a">
|
|
<th>Property </th>
|
|
<th>Meaning </th>
|
|
<th>Default </th></tr>
|
|
</thead><tbody>
|
|
|
|
<tr class="b">
|
|
<td><code>fs.s3a.prefetch.enabled</code> </td>
|
|
<td>Enable the prefetch input stream </td>
|
|
<td><code>false</code> </td></tr>
|
|
<tr class="a">
|
|
<td><code>fs.s3a.prefetch.block.size</code> </td>
|
|
<td>Size of a block </td>
|
|
<td><code>8M</code> </td></tr>
|
|
<tr class="b">
|
|
<td><code>fs.s3a.prefetch.block.count</code> </td>
|
|
<td>Number of blocks to prefetch </td>
|
|
<td><code>8</code> </td></tr>
|
|
</tbody>
|
|
</table>
|
|
<p>The default size of a block is 8MB, and the minimum allowed block size is 1 byte. Decreasing block size will increase the number of blocks to be read for a file. A smaller block size may negatively impact performance as the number of prefetches required will increase.</p></section><section>
|
|
<h3><a name="Key_Components"></a>Key Components</h3>
|
|
<p><code>S3PrefetchingInputStream</code> - When prefetching is enabled, S3AFileSystem will return an instance of this class as the input stream. Depending on the remote file size, it will either use the <code>S3InMemoryInputStream</code> or the <code>S3CachingInputStream</code> as the underlying input stream.</p>
|
|
<p><code>S3InMemoryInputStream</code> - Underlying input stream used when the remote file size < configured block size. Will read the entire remote file into memory.</p>
|
|
<p><code>S3CachingInputStream</code> - Underlying input stream used when remote file size > configured block size. Uses asynchronous prefetching of blocks and caching to improve performance.</p>
|
|
<p><code>BlockData</code> - Holds information about the blocks in a remote file, such as:</p>
|
|
<ul>
|
|
|
|
<li>Number of blocks in the remote file</li>
|
|
<li>Block size</li>
|
|
<li>State of each block (initially all blocks have state <i>NOT_READY</i>). Other states are: Queued, Ready, Cached.</li>
|
|
</ul>
|
|
<p><code>BufferData</code> - Holds the buffer and additional information about it such as:</p>
|
|
<ul>
|
|
|
|
<li>The block number this buffer is for</li>
|
|
<li>State of the buffer (Unknown, Blank, Prefetching, Caching, Ready, Done). Initial state of a buffer is blank.</li>
|
|
</ul>
|
|
<p><code>CachingBlockManager</code> - Implements reading data into the buffer, prefetching and caching.</p>
|
|
<p><code>BufferPool</code> - Manages a fixed sized pool of buffers. It’s used by <code>CachingBlockManager</code> to acquire buffers.</p>
|
|
<p><code>S3File</code> - Implements operations to interact with S3 such as opening and closing the input stream to the remote file in S3.</p>
|
|
<p><code>S3Reader</code> - Implements reading from the stream opened by <code>S3File</code>. Reads from this input stream in blocks of 64KB.</p>
|
|
<p><code>FilePosition</code> - Provides functionality related to tracking the position in the file. Also gives access to the current buffer in use.</p>
|
|
<p><code>SingleFilePerBlockCache</code> - Responsible for caching blocks to the local file system. Each cache block is stored on the local disk as a separate block file.</p></section><section>
|
|
<h3><a name="Operation"></a>Operation</h3><section>
|
|
<h4><a name="S3InMemoryInputStream"></a>S3InMemoryInputStream</h4>
|
|
<p>For a remote file with size 5MB, and block size = 8MB, since file size is less than the block size, the <code>S3InMemoryInputStream</code> will be used.</p>
|
|
<p>If the caller makes the following read calls:</p>
|
|
|
|
<div class="source">
|
|
<div class="source">
|
|
<pre>in.read(buffer, 0, 3MB);
|
|
in.read(buffer, 0, 2MB);
|
|
</pre></div></div>
|
|
|
|
<p>When the first read is issued, there is no buffer in use yet. The <code>S3InMemoryInputStream</code> gets the data in this remote file by calling the <code>ensureCurrentBuffer()</code> method, which ensures that a buffer with data is available to be read from.</p>
|
|
<p>The <code>ensureCurrentBuffer()</code> then:</p>
|
|
<ul>
|
|
|
|
<li>Reads data into a buffer by calling <code>S3Reader.read(ByteBuffer buffer, long offset, int size)</code>.</li>
|
|
<li><code>S3Reader</code> uses <code>S3File</code> to open an input stream to the remote file in S3 by making a <code>getObject()</code> request with range as <code>(0, filesize)</code>.</li>
|
|
<li>The <code>S3Reader</code> reads the entire remote file into the provided buffer, and once reading is complete closes the S3 stream and frees all underlying resources.</li>
|
|
<li>Now the entire remote file is in a buffer, set this data in <code>FilePosition</code> so it can be accessed by the input stream.</li>
|
|
</ul>
|
|
<p>The read operation now just gets the required bytes from the buffer in <code>FilePosition</code>.</p>
|
|
<p>When the second read is issued, there is already a valid buffer which can be used. Don’t do anything else, just read the required bytes from this buffer.</p></section><section>
|
|
<h4><a name="S3CachingInputStream"></a>S3CachingInputStream</h4>
|
|
<p>If there is a remote file with size 40MB and block size = 8MB, the <code>S3CachingInputStream</code> will be used.</p><section>
|
|
<h5><a name="Sequential_Reads"></a>Sequential Reads</h5>
|
|
<p>If the caller makes the following calls:</p>
|
|
|
|
<div class="source">
|
|
<div class="source">
|
|
<pre>in.read(buffer, 0, 5MB)
|
|
in.read(buffer, 0, 8MB)
|
|
</pre></div></div>
|
|
|
|
<p>For the first read call, there is no valid buffer yet. <code>ensureCurrentBuffer()</code> is called, and for the first <code>read()</code>, prefetch count is set as 1.</p>
|
|
<p>The current block (block 0) is read synchronously, while the blocks to be prefetched (block 1) is read asynchronously.</p>
|
|
<p>The <code>CachingBlockManager</code> is responsible for getting buffers from the buffer pool and reading data into them. This process of acquiring the buffer pool works as follows:</p>
|
|
<ul>
|
|
|
|
<li>The buffer pool keeps a map of allocated buffers and a pool of available buffers. The size of this pool is = prefetch block count + 1. If the prefetch block count is 8, the buffer pool has a size of 9.</li>
|
|
<li>If the pool is not yet at capacity, create a new buffer and add it to the pool.</li>
|
|
<li>If it’s at capacity, check if any buffers with state = done can be released. Releasing a buffer means removing it from allocated and returning it back to the pool of available buffers.</li>
|
|
<li>If there are no buffers with state = done currently then nothing will be released, so retry the above step at a fixed interval a few times till a buffer becomes available.</li>
|
|
<li>If after multiple retries there are still no available buffers, release a buffer in the ready state. The buffer for the block furthest from the current block is released.</li>
|
|
</ul>
|
|
<p>Once a buffer has been acquired by <code>CachingBlockManager</code>, if the buffer is in a <i>READY</i> state, it is returned. This means that data was already read into this buffer asynchronously by a prefetch. If its state is <i>BLANK</i> then data is read into it using <code>S3Reader.read(ByteBuffer buffer, long offset, int size).</code></p>
|
|
<p>For the second read call, <code>in.read(buffer, 0, 8MB)</code>, since the block sizes are of 8MB and only 5MB of block 0 has been read so far, 3MB of the required data will be read from the current block 0. Once all data has been read from this block, <code>S3CachingInputStream</code> requests the next block ( block 1), which will already have been prefetched and so it can just start reading from it. Also, while reading from block 1 it will also issue prefetch requests for the next blocks. The number of blocks to be prefetched is determined by <code>fs.s3a.prefetch.block.count</code>.</p></section><section>
|
|
<h5><a name="Random_Reads"></a>Random Reads</h5>
|
|
<p>The <code>CachingInputStream</code> also caches prefetched blocks. This happens when <code>read()</code> is issued after a <code>seek()</code> outside the current block, but the current block still has not been fully read.</p>
|
|
<p>For example, consider the following calls:</p>
|
|
|
|
<div class="source">
|
|
<div class="source">
|
|
<pre>in.read(buffer, 0, 5MB)
|
|
in.seek(10MB)
|
|
in.read(buffer, 0, 4MB)
|
|
in.seek(2MB)
|
|
in.read(buffer, 0, 4MB)
|
|
</pre></div></div>
|
|
|
|
<p>For the above read sequence, after the <code>seek(10MB)</code> call is issued, block 0 has not been read completely so the subsequent call to <code>read()</code> will cache it, on the assumption that the caller will probably want to read from it again.</p>
|
|
<p>After <code>seek(2MB)</code> is called, the position is back inside block 0. The next read can now be satisfied from the locally cached block file, which is typically orders of magnitude faster than a network based read.</p>
|
|
<p>NB: <code>seek()</code> is implemented lazily, so it only keeps track of the new position but does not otherwise affect the internal state of the stream. Only when a <code>read()</code> is issued, it will call the <code>ensureCurrentBuffer()</code> method and fetch a new block if required.</p></section></section></section></section>
|
|
</div>
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
<div id="footer">
|
|
<div class="xright">
|
|
© 2008-2023
|
|
Apache Software Foundation
|
|
|
|
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
|
|
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
|
|
</div>
|
|
<div class="clear">
|
|
<hr/>
|
|
</div>
|
|
</div>
|
|
</body>
|
|
</html>
|