hadoop/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement....

896 lines
49 KiB
HTML
Raw Normal View History

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-02-21
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop 3.4.0-SNAPSHOT &#x2013; Centralized Cache Management in HDFS</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230221" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
|
<a href="http://hadoop.apache.org/" class="externalLink">Apache Hadoop</a>
&nbsp;| Last Published: 2023-02-21
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../../index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Centralized Cache Management in HDFS</h1>
<ul>
<li><a href="#Overview">Overview</a></li>
<li><a href="#Use_Cases">Use Cases</a></li>
<li><a href="#Architecture">Architecture</a></li>
<li><a href="#Concepts">Concepts</a>
<ul>
<li><a href="#Cache_directive">Cache directive</a></li>
<li><a href="#Cache_pool">Cache pool</a></li></ul></li>
<li><a href="#cacheadmin_command-line_interface">cacheadmin command-line interface</a>
<ul>
<li><a href="#Cache_directive_commands">Cache directive commands</a>
<ul>
<li><a href="#addDirective">addDirective</a></li>
<li><a href="#removeDirective">removeDirective</a></li>
<li><a href="#removeDirectives">removeDirectives</a></li>
<li><a href="#listDirectives">listDirectives</a></li></ul></li>
<li><a href="#Cache_pool_commands">Cache pool commands</a>
<ul>
<li><a href="#addPool">addPool</a></li>
<li><a href="#modifyPool">modifyPool</a></li>
<li><a href="#removePool">removePool</a></li>
<li><a href="#listPools">listPools</a></li>
<li><a href="#help">help</a></li></ul></li></ul></li>
<li><a href="#Configuration">Configuration</a>
<ul>
<li><a href="#Native_Libraries">Native Libraries</a></li>
<li><a href="#Configuration_Properties">Configuration Properties</a>
<ul>
<li><a href="#Required">Required</a></li>
<li><a href="#Optional">Optional</a></li></ul></li>
<li><a href="#OS_Limits">OS Limits</a></li></ul></li></ul>
<section>
<h2><a name="Overview"></a>Overview</h2>
<p><i>Centralized cache management</i> in HDFS is an explicit caching mechanism that allows users to specify <i>paths</i> to be cached by HDFS. The NameNode will communicate with DataNodes that have the desired blocks on disk, and instruct them to cache the blocks in off-heap caches.</p>
<p>Centralized cache management in HDFS has many significant advantages.</p>
<ol style="list-style-type: decimal">
<li>
<p>Explicit pinning prevents frequently used data from being evicted from memory. This is particularly important when the size of the working set exceeds the size of main memory, which is common for many HDFS workloads.</p>
</li>
<li>
<p>Because DataNode caches are managed by the NameNode, applications can query the set of cached block locations when making task placement decisions. Co-locating a task with a cached block replica improves read performance.</p>
</li>
<li>
<p>When block has been cached by a DataNode, clients can use a new , more-efficient, zero-copy read API. Since checksum verification of cached data is done once by the DataNode, clients can incur essentially zero overhead when using this new API.</p>
</li>
<li>
<p>Centralized caching can improve overall cluster memory utilization. When relying on the OS buffer cache at each DataNode, repeated reads of a block will result in all <i>n</i> replicas of the block being pulled into buffer cache. With centralized cache management, a user can explicitly pin only <i>m</i> of the <i>n</i> replicas, saving <i>n-m</i> memory.</p>
</li>
<li>
<p>HDFS supports non-volatile storage class memory (SCM, also known as persistent memory) cache in Linux platform. User can enable either DRAM cache or SCM cache for a DataNode. DRAM cache and SCM cache can coexist among DataNodes. In addition, cache persistence is supported by SCM cache. The status of cache persisted in SCM will be recovered during the start of DataNode if <code>dfs.datanode.pmem.cache.recovery</code> is set to true. Otherwise, previously persisted cache will be dropped and data need to be re-cached.</p>
</li>
</ol></section><section>
<h2><a name="Use_Cases"></a>Use Cases</h2>
<p>Centralized cache management is useful for files that accessed repeatedly. For example, a small <i>fact table</i> in Hive which is often used for joins is a good candidate for caching. On the other hand, caching the input of a <i>one year reporting query</i> is probably less useful, since the historical data might only be read once.</p>
<p>Centralized cache management is also useful for mixed workloads with performance SLAs. Caching the working set of a high-priority workload insures that it does not contend for disk I/O with a low-priority workload.</p></section><section>
<h2><a name="Architecture"></a>Architecture</h2>
<p><img src="images/caching.png" alt="Caching Architecture" /></p>
<p>In this architecture, the NameNode is responsible for coordinating all the DataNode off-heap caches in the cluster. The NameNode periodically receives a <i>cache report</i> from each DataNode which describes all the blocks cached on a given DN. The NameNode manages DataNode caches by piggybacking cache and uncache commands on the DataNode heartbeat.</p>
<p>The NameNode queries its set of <i>cache directives</i> to determine which paths should be cached. Cache directives are persistently stored in the fsimage and edit log, and can be added, removed, and modified via Java and command-line APIs. The NameNode also stores a set of <i>cache pools</i>, which are administrative entities used to group cache directives together for resource management and enforcing permissions.</p>
<p>The NameNode periodically rescans the namespace and active cache directives to determine which blocks need to be cached or uncached and assign caching work to DataNodes. Rescans can also be triggered by user actions like adding or removing a cache directive or removing a cache pool.</p>
<p>We do not currently cache blocks which are under construction, corrupt, or otherwise incomplete. If a cache directive covers a symlink, the symlink target is not cached.</p>
<p>Caching is currently done on the file or directory-level. Block and sub-block caching is an item of future work.</p></section><section>
<h2><a name="Concepts"></a>Concepts</h2><section>
<h3><a name="Cache_directive"></a>Cache directive</h3>
<p>A <i>cache directive</i> defines a path that should be cached. Paths can be either directories or files. Directories are cached non-recursively, meaning only files in the first-level listing of the directory.</p>
<p>Directives also specify additional parameters, such as the cache replication factor and expiration time. The replication factor specifies the number of block replicas to cache. If multiple cache directives refer to the same file, the maximum cache replication factor is applied.</p>
<p>The expiration time is specified on the command line as a <i>time-to-live (TTL)</i>, a relative expiration time in the future. After a cache directive expires, it is no longer considered by the NameNode when making caching decisions.</p></section><section>
<h3><a name="Cache_pool"></a>Cache pool</h3>
<p>A <i>cache pool</i> is an administrative entity used to manage groups of cache directives. Cache pools have UNIX-like <i>permissions</i>, which restrict which users and groups have access to the pool. Write permissions allow users to add and remove cache directives to the pool. Read permissions allow users to list the cache directives in a pool, as well as additional metadata. Execute permissions are unused.</p>
<p>Cache pools are also used for resource management. Pools can enforce a maximum <i>limit</i>, which restricts the number of bytes that can be cached in aggregate by directives in the pool. Normally, the sum of the pool limits will approximately equal the amount of aggregate memory reserved for HDFS caching on the cluster. Cache pools also track a number of statistics to help cluster users determine what is and should be cached.</p>
<p>Pools also can enforce a maximum time-to-live. This restricts the maximum expiration time of directives being added to the pool.</p></section></section><section>
<h2><a name="cacheadmin_command-line_interface"></a><code>cacheadmin</code> command-line interface</h2>
<p>On the command-line, administrators and users can interact with cache pools and directives via the <code>hdfs cacheadmin</code> subcommand.</p>
<p>Cache directives are identified by a unique, non-repeating 64-bit integer ID. IDs will not be reused even if a cache directive is later removed.</p>
<p>Cache pools are identified by a unique string name.</p><section>
<h3><a name="Cache_directive_commands"></a>Cache directive commands</h3><section>
<h4><a name="addDirective"></a>addDirective</h4>
<p>Usage: <code>hdfs cacheadmin -addDirective -path &lt;path&gt; -pool &lt;pool-name&gt; [-force] [-replication &lt;replication&gt;] [-ttl &lt;time-to-live&gt;]</code></p>
<p>Add a new cache directive.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;path&gt; </td>
<td align="left"> A path to cache. The path can be a directory or a file. </td></tr>
<tr class="a">
<td align="left"> &lt;pool-name&gt; </td>
<td align="left"> The pool to which the directive will be added. You must have write permission on the cache pool in order to add new directives. </td></tr>
<tr class="b">
<td align="left"> -force </td>
<td align="left"> Skips checking of cache pool resource limits. </td></tr>
<tr class="a">
<td align="left"> &lt;replication&gt; </td>
<td align="left"> The cache replication factor to use. Defaults to 1. </td></tr>
<tr class="b">
<td align="left"> &lt;time-to-live&gt; </td>
<td align="left"> How long the directive is valid. Can be specified in minutes, hours, and days, e.g. 30m, 4h, 2d. Valid units are [smhd]. &#x201c;never&#x201d; indicates a directive that never expires. If unspecified, the directive never expires. </td></tr>
</tbody>
</table></section><section>
<h4><a name="removeDirective"></a>removeDirective</h4>
<p>Usage: <code>hdfs cacheadmin -removeDirective &lt;id&gt;</code></p>
<p>Remove a cache directive.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;id&gt; </td>
<td align="left"> The id of the cache directive to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cachedirective IDs, use the -listDirectives command. </td></tr>
</tbody>
</table></section><section>
<h4><a name="removeDirectives"></a>removeDirectives</h4>
<p>Usage: <code>hdfs cacheadmin -removeDirectives &lt;path&gt;</code></p>
<p>Remove every cache directive with the specified path.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;path&gt; </td>
<td align="left"> The path of the cache directives to remove. You must have write permission on the pool of the directive in order to remove it. To see a list of cache directives, use the -listDirectives command. </td></tr>
</tbody>
</table></section><section>
<h4><a name="listDirectives"></a>listDirectives</h4>
<p>Usage: <code>hdfs cacheadmin -listDirectives [-stats] [-path &lt;path&gt;] [-pool &lt;pool&gt;]</code></p>
<p>List cache directives.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;path&gt; </td>
<td align="left"> List only cache directives with this path. Note that if there is a cache directive for <i>path</i> in a cache pool that we don&#x2019;t have read access for, it will not be listed. </td></tr>
<tr class="a">
<td align="left"> &lt;pool&gt; </td>
<td align="left"> List only path cache directives in that pool. </td></tr>
<tr class="b">
<td align="left"> -stats </td>
<td align="left"> List path-based cache directive statistics. </td></tr>
</tbody>
</table></section></section><section>
<h3><a name="Cache_pool_commands"></a>Cache pool commands</h3><section>
<h4><a name="addPool"></a>addPool</h4>
<p>Usage: <code>hdfs cacheadmin -addPool &lt;name&gt; [-owner &lt;owner&gt;] [-group &lt;group&gt;] [-mode &lt;mode&gt;] [-limit &lt;limit&gt;] [-maxTtl &lt;maxTtl&gt;]</code></p>
<p>Add a new cache pool.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;name&gt; </td>
<td align="left"> Name of the new pool. </td></tr>
<tr class="a">
<td align="left"> &lt;owner&gt; </td>
<td align="left"> Username of the owner of the pool. Defaults to the current user. </td></tr>
<tr class="b">
<td align="left"> &lt;group&gt; </td>
<td align="left"> Group of the pool. Defaults to the primary group name of the current user. </td></tr>
<tr class="a">
<td align="left"> &lt;mode&gt; </td>
<td align="left"> UNIX-style permissions for the pool. Permissions are specified in octal, e.g. 0755. By default, this is set to 0755. </td></tr>
<tr class="b">
<td align="left"> &lt;limit&gt; </td>
<td align="left"> The maximum number of bytes that can be cached by directives in this pool, in aggregate. By default, no limit is set. </td></tr>
<tr class="a">
<td align="left"> &lt;maxTtl&gt; </td>
<td align="left"> The maximum allowed time-to-live for directives being added to the pool. This can be specified in seconds, minutes, hours, and days, e.g. 120s, 30m, 4h, 2d. Valid units are [smhd]. By default, no maximum is set. A value of &#xa0;&quot;never&#xa0;&quot; specifies that there is no limit. </td></tr>
</tbody>
</table></section><section>
<h4><a name="modifyPool"></a>modifyPool</h4>
<p>Usage: <code>hdfs cacheadmin -modifyPool &lt;name&gt; [-owner &lt;owner&gt;] [-group &lt;group&gt;] [-mode &lt;mode&gt;] [-limit &lt;limit&gt;] [-maxTtl &lt;maxTtl&gt;]</code></p>
<p>Modifies the metadata of an existing cache pool.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;name&gt; </td>
<td align="left"> Name of the pool to modify. </td></tr>
<tr class="a">
<td align="left"> &lt;owner&gt; </td>
<td align="left"> Username of the owner of the pool. </td></tr>
<tr class="b">
<td align="left"> &lt;group&gt; </td>
<td align="left"> Groupname of the group of the pool. </td></tr>
<tr class="a">
<td align="left"> &lt;mode&gt; </td>
<td align="left"> Unix-style permissions of the pool in octal. </td></tr>
<tr class="b">
<td align="left"> &lt;limit&gt; </td>
<td align="left"> Maximum number of bytes that can be cached by this pool. </td></tr>
<tr class="a">
<td align="left"> &lt;maxTtl&gt; </td>
<td align="left"> The maximum allowed time-to-live for directives being added to the pool. </td></tr>
</tbody>
</table></section><section>
<h4><a name="removePool"></a>removePool</h4>
<p>Usage: <code>hdfs cacheadmin -removePool &lt;name&gt;</code></p>
<p>Remove a cache pool. This also uncaches paths associated with the pool.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;name&gt; </td>
<td align="left"> Name of the cache pool to remove. </td></tr>
</tbody>
</table></section><section>
<h4><a name="listPools"></a>listPools</h4>
<p>Usage: <code>hdfs cacheadmin -listPools [-stats] [&lt;name&gt;]</code></p>
<p>Display information about one or more cache pools, e.g. name, owner, group, permissions, etc.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> -stats </td>
<td align="left"> Display additional cache pool statistics. </td></tr>
<tr class="a">
<td align="left"> &lt;name&gt; </td>
<td align="left"> If specified, list only the named cache pool. </td></tr>
</tbody>
</table></section><section>
<h4><a name="help"></a>help</h4>
<p>Usage: <code>hdfs cacheadmin -help &lt;command-name&gt;</code></p>
<p>Get detailed help about a command.</p>
<table border="0" class="bodyTable">
<thead>
<tr class="a">
<th align="left"> </th>
<th align="left"> </th></tr>
</thead><tbody>
<tr class="b">
<td align="left"> &lt;command-name&gt; </td>
<td align="left"> The command for which to get detailed help. If no command is specified, print detailed help for all commands. </td></tr>
</tbody>
</table></section></section></section><section>
<h2><a name="Configuration"></a>Configuration</h2><section>
<h3><a name="Native_Libraries"></a>Native Libraries</h3>
<p>In order to lock block files into memory, the DataNode relies on native JNI code found in <code>libhadoop.so</code> or <code>hadoop.dll</code> on Windows. Be sure to <a href="../hadoop-common/NativeLibraries.html">enable JNI</a> if you are using HDFS centralized cache management.</p>
<p>Currently, there are two implementations for persistent memory cache. The default one is pure Java based implementation and the other is native implementation which leverages PMDK library to improve the performance of cache write and cache read.</p>
<p>To enable PMDK based implementation, please follow the below steps.</p>
<ol style="list-style-type: decimal">
<li>
<p>Install PMDK library. Please refer to the official site <a class="externalLink" href="http://pmem.io/">http://pmem.io/</a> for detailed information.</p>
</li>
<li>
<p>Build Hadoop with PMDK support. Please refer to &#x201c;PMDK library build options&#x201d; section in <code>BUILDING.txt</code> in the source code.</p>
</li>
</ol>
<p>To verify that PMDK is correctly detected by Hadoop, run the <code>hadoop checknative</code> command.</p></section><section>
<h3><a name="Configuration_Properties"></a>Configuration Properties</h3><section>
<h4><a name="Required"></a>Required</h4>
<p>Be sure to configure one of the following properties for DRAM cache or persistent memory cache. Please note that DRAM cache and persistent cache cannot coexist on a DataNode.</p>
<ul>
<li>
<p>dfs.datanode.max.locked.memory</p>
<p>This determines the maximum amount of memory a DataNode will use for caching. On Unix-like systems, the &#x201c;locked-in-memory size&#x201d; ulimit (<code>ulimit -l</code>) of the DataNode user also needs to be increased to match this parameter (see below section on <a href="#OS_Limits">OS Limits</a>). When setting this value, please remember that you will need space in memory for other things as well, such as the DataNode and application JVM heaps and the operating system page cache.</p>
<p>This setting is shared with the <a href="./MemoryStorage.html">Lazy Persist Writes feature</a>. The Data Node will ensure that the combined memory used by Lazy Persist Writes and Centralized Cache Management does not exceed the amount configured in <code>dfs.datanode.max.locked.memory</code>.</p>
</li>
<li>
<p>dfs.datanode.pmem.cache.dirs</p>
<p>This property specifies the cache volume of persistent memory. For multiple volumes, they should be separated by &#x201c;,&#x201d;, e.g. &#x201c;/mnt/pmem0, /mnt/pmem1&#x201d;. The default value is empty. If this property is configured, the volume capacity will be detected. And there is no need to configure <code>dfs.datanode.max.locked.memory</code>.</p>
</li>
</ul></section><section>
<h4><a name="Optional"></a>Optional</h4>
<p>The following properties are not required, but may be specified for tuning:</p>
<ul>
<li>
<p>dfs.namenode.path.based.cache.refresh.interval.ms</p>
<p>The NameNode will use this as the amount of milliseconds between subsequent path cache rescans. This calculates the blocks to cache and each DataNode containing a replica of the block that should cache it.</p>
<p>By default, this parameter is set to 30000, which is thirty seconds.</p>
</li>
<li>
<p>dfs.datanode.fsdatasetcache.max.threads.per.volume</p>
<p>The DataNode will use this as the maximum number of threads per volume to use for caching new data.</p>
<p>By default, this parameter is set to 4.</p>
</li>
<li>
<p>dfs.cachereport.intervalMsec</p>
<p>The DataNode will use this as the amount of milliseconds between sending a full report of its cache state to the NameNode.</p>
<p>By default, this parameter is set to 10000, which is 10 seconds.</p>
</li>
<li>
<p>dfs.namenode.path.based.cache.block.map.allocation.percent</p>
<p>The percentage of the Java heap which we will allocate to the cached blocks map. The cached blocks map is a hash map which uses chained hashing. Smaller maps may be accessed more slowly if the number of cached blocks is large; larger maps will consume more memory. The default is 0.25 percent.</p>
</li>
<li>
<p>dfs.namenode.caching.enabled</p>
<p>This parameter can be used to enable/disable the centralized caching in NameNode. When centralized caching is disabled, NameNode will not process cache reports or store information about block cache locations on the cluster. Note that NameNode will continute to store the path based cache locations in the file-system metadata, even though it will not act on this information until the caching is enabled. The default value for this parameter is true (i.e. centralized caching is enabled). In the current implementation, centralized caching introduces additional write lock overhead (see CacheReplicationMonitor#rescan) even if no path to cache is specified, so we recommend disabling this feature when not in use. We will disable centralized caching by default in later versions.</p>
</li>
<li>
<p>dfs.datanode.pmem.cache.recovery</p>
<p>This parameter is used to determine whether to recover the status for previous cache on persistent memory during the start of DataNode. If it is enabled, DataNode will recover the status for previously cached data on persistent memory. Thus, re-caching is avoided. If this property is not enabled, DataNode will drop cache, if any, on persistent memory. This property can only work when persistent memory cache is enabled, i.e., <code>dfs.datanode.pmem.cache.dirs</code> is configured.</p>
</li>
</ul></section></section><section>
<h3><a name="OS_Limits"></a>OS Limits</h3>
<p>If you get the error &#x201c;Cannot start datanode because the configured max locked memory size&#x2026; is more than the datanode&#x2019;s available RLIMIT_MEMLOCK ulimit,&#x201d; that means that the operating system is imposing a lower limit on the amount of memory that you can lock than what you have configured. To fix this, you must adjust the ulimit -l value that the DataNode runs with. Usually, this value is configured in <code>/etc/security/limits.conf</code>. However, it will vary depending on what operating system and distribution you are using.</p>
<p>You will know that you have correctly configured this value when you can run <code>ulimit -l</code> from the shell and get back either a higher value than what you have configured with <code>dfs.datanode.max.locked.memory</code>, or the string &#x201c;unlimited,&#x201d; indicating that there is no limit. Note that it&#x2019;s typical for <code>ulimit -l</code> to output the memory lock limit in KB, but dfs.datanode.max.locked.memory must be specified in bytes.</p>
<p>This information does not apply to deployments on Windows. Windows has no direct equivalent of <code>ulimit -l</code>.</p></section></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>