hadoop/hadoop-gridmix/GridMix.html

1243 lines
58 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-05-29
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Gridmix &#x2013; Gridmix</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230529" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2023-05-29
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Gridmix</h1><hr />
<ul>
<li><a href="#Overview">Overview</a></li>
<li><a href="#Usage">Usage</a></li>
<li><a href="#General_Configuration_Parameters">General Configuration Parameters</a></li>
<li><a href="#Job_Types">Job Types</a></li>
<li><a href="#Job_Submission_Policies">Job Submission Policies</a></li>
<li><a href="#Emulating_Users_and_Queues">Emulating Users and Queues</a></li>
<li><a href="#Emulating_Distributed_Cache_Load">Emulating Distributed Cache Load</a></li>
<li><a href="#Configuration_of_Simulated_Jobs">Configuration of Simulated Jobs</a></li>
<li><a href="#Emulating_CompressionDecompression">Emulating Compression/Decompression</a></li>
<li><a href="#Emulating_High-Ram_jobs">Emulating High-Ram jobs</a></li>
<li><a href="#Emulating_resource_usages">Emulating resource usages</a></li>
<li><a href="#Simplifying_Assumptions">Simplifying Assumptions</a></li>
<li><a href="#Appendix">Appendix</a></li>
</ul><hr /><section>
<h2><a name="Overview"></a>Overview</h2>
<p>GridMix is a benchmark for Hadoop clusters. It submits a mix of synthetic jobs, modeling a profile mined from production loads. This version of the tool will attempt to model the resource profiles of production jobs to identify bottlenecks, guide development.</p>
<p>To run GridMix, you need a MapReduce job trace describing the job mix for a given cluster. Such traces are typically generated by <a href="../hadoop-rumen/Rumen.html">Rumen</a>. GridMix also requires input data from which the synthetic jobs will be reading bytes. The input data need not be in any particular format, as the synthetic jobs are currently binary readers. If you are running on a new cluster, an optional step generating input data may precede the run. In order to emulate the load of production jobs from a given cluster on the same or another cluster, follow these steps:</p>
<ol style="list-style-type: decimal">
<li>
<p>Locate the job history files on the production cluster. This location is specified by the <code>mapreduce.jobhistory.done-dir</code> or <code>mapreduce.jobhistory.intermediate-done-dir</code> configuration property of the cluster. (<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver">MapReduce historyserver</a> moves job history files from <code>mapreduce.jobhistory.done-dir</code> to <code>mapreduce.jobhistory.intermediate-done-dir</code>.)</p>
</li>
<li>
<p>Run <a href="../hadoop-rumen/Rumen.html">Rumen</a> to build a job trace in JSON format for all or select jobs.</p>
</li>
<li>
<p>Use GridMix with the job trace on the benchmark cluster.</p>
</li>
</ol>
<p>Jobs submitted by GridMix have names of the form &#x201c;<code>GRIDMIXnnnnnn</code>&#x201d;, where &#x201c;<code>nnnnnn</code>&#x201d; is a sequence number padded with leading zeroes.</p></section><section>
<h2><a name="Usage"></a>Usage</h2>
<p>Gridmix is provided as hadoop subcommand. Basic command-line usage without configuration parameters:</p>
<div class="source">
<div class="source">
<pre>$ hadoop gridmix [-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
</pre></div></div>
<p>Basic command-line usage with configuration parameters:</p>
<div class="source">
<div class="source">
<pre>$ hadoop gridmix \
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
[-generate &lt;size&gt;] [-users &lt;users-list&gt;] &lt;iopath&gt; &lt;trace&gt;
</pre></div></div>
<blockquote>
<p>Configuration parameters like <code>-Dgridmix.client.submit.threads=10</code> and <code>-Dgridmix.output.directory=foo</code> as given above should be used <i>before</i> other GridMix parameters.</p>
</blockquote>
<p>The <code>&lt;iopath&gt;</code> parameter is the working directory for GridMix. Note that this can either be on the local file-system or on HDFS, but it is highly recommended that it be the same as that for the original job mix so that GridMix puts the same load on the local file-system and HDFS respectively.</p>
<p>The <code>-generate</code> option is used to generate input data and Distributed Cache files for the synthetic jobs. It accepts standard units of size suffixes, e.g. <code>100g</code> will generate 100 * 2<sup>30</sup> bytes as input data. The minimum size of input data in compressed format (128MB by default) is defined by <code>gridmix.min.file.size</code>. <code>&lt;iopath&gt;/input</code> is the destination directory for generated input data and/or the directory from which input data will be read. HDFS-based Distributed Cache files are generated under the distributed cache directory <code>&lt;iopath&gt;/distributedCache</code>. If some of the needed Distributed Cache files are already existing in the distributed cache directory, then only the remaining non-existing Distributed Cache files are generated when <code>-generate</code> option is specified.</p>
<p>The <code>-users</code> option is used to point to a users-list file (see <a href="#usersqueues">Emulating Users and Queues</a>).</p>
<p>The <code>&lt;trace&gt;</code> parameter is a path to a job trace generated by Rumen. This trace can be compressed (it must be readable using one of the compression codecs supported by the cluster) or uncompressed. Use &#x201c;-&#x201d; as the value of this parameter if you want to pass an <i>uncompressed</i> trace via the standard input-stream of GridMix.</p>
<p>The supported configuration parameters are explained in the following sections.</p></section><section>
<h2><a name="General_Configuration_Parameters"></a>General Configuration Parameters</h2>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>gridmix.output.directory</code>
</td>
<td>The directory into which output will be written. If specified,
<code>iopath</code> will be relative to this parameter. The
submitting user must have read/write access to this directory. The
user should also be mindful of any quota issues that may arise
during a run. The default is &quot;<code>gridmix</code>&quot;.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.client.submit.threads</code>
</td>
<td>The number of threads submitting jobs to the cluster. This
also controls how many splits will be loaded into memory at a given
time, pending the submit time in the trace. Splits are pre-generated
to hit submission deadlines, so particularly dense traces may want
more submitting threads. However, storing splits in memory is
reasonably expensive, so you should raise this cautiously. The
default is 1 for the SERIAL job-submission policy (see
<a href="#policies">Job Submission Policies</a>) and one more than
the number of processors on the client machine for the other
policies.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.submit.multiplier</code>
</td>
<td>The multiplier to accelerate or decelerate the submission of
jobs. The time separating two jobs is multiplied by this factor.
The default value is 1.0. This is a crude mechanism to size
a job trace to a cluster.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.client.pending.queue.depth</code>
</td>
<td>The depth of the queue of job descriptions awaiting split
generation. The jobs read from the trace occupy a queue of this
depth before being processed by the submission threads. It is
unusual to configure this. The default is 5.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.gen.blocksize</code>
</td>
<td>The block-size of generated data. The default value is 256
MiB.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.gen.bytes.per.file</code>
</td>
<td>The maximum bytes written per file. The default value is 1
GiB.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.min.file.size</code>
</td>
<td>The minimum size of the input files. The default limit is 128
MiB. Tweak this parameter if you see an error-message like
&quot;Found no satisfactory file&quot; while testing GridMix with
a relatively-small input data-set.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.max.total.scan</code>
</td>
<td>The maximum size of the input files. The default limit is 100
TiB.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.task.jvm-options.enable</code>
</td>
<td>Enables Gridmix to configure the simulated task's max heap
options using the values obtained from the original task (i.e via
trace).
</td>
</tr>
</table>
</section><section>
<h2><a name="Job_Types"></a>Job Types</h2>
<p>GridMix takes as input a job trace, essentially a stream of JSON-encoded job descriptions. For each job description, the submission client obtains the original job submission time and for each task in that job, the byte and record counts read and written. Given this data, it constructs a synthetic job with the same byte and record patterns as recorded in the trace. It constructs jobs of two types:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Job Type</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>LOADJOB</code>
</td>
<td>A synthetic job that emulates the workload mentioned in Rumen
trace. In the current version we are supporting I/O. It reproduces
the I/O workload on the benchmark cluster. It does so by embedding
the detailed I/O information for every map and reduce task, such as
the number of bytes and records read and written, into each
job's input splits. The map tasks further relay the I/O patterns of
reduce tasks through the intermediate map output data.</td>
</tr>
<tr class="a">
<td>
<code>SLEEPJOB</code>
</td>
<td>A synthetic job where each task does *nothing* but sleep
for a certain duration as observed in the production trace. The
scalability of the ResourceManager is often limited by how many
heartbeats it can handle every second. (Heartbeats are periodic
messages sent from NodeManagers to update their status and grab new
tasks from the ResourceManager.) Since a benchmark cluster is typically
a fraction in size of a production cluster, the heartbeat traffic
generated by the slave nodes is well below the level of the
production cluster. One possible solution is to run multiple
NodeManagers on each slave node. This leads to the obvious problem that
the I/O workload generated by the synthetic jobs would thrash the
slave nodes. Hence the need for such a job.</td>
</tr>
</table>
<p>The following configuration parameters affect the job type:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>gridmix.job.type</code>
</td>
<td>The value for this key can be one of LOADJOB or SLEEPJOB. The
default value is LOADJOB.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.key.fraction</code>
</td>
<td>For a LOADJOB type of job, the fraction of a record used for
the data for the key. The default value is 0.1.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.sleep.maptask-only</code>
</td>
<td>For a SLEEPJOB type of job, whether to ignore the reduce
tasks for the job. The default is <code>false</code>.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.sleep.fake-locations</code>
</td>
<td>For a SLEEPJOB type of job, the number of fake locations
for map tasks for the job. The default is 0.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.sleep.max-map-time</code>
</td>
<td>For a SLEEPJOB type of job, the maximum runtime for map
tasks for the job in milliseconds. The default is unlimited.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.sleep.max-reduce-time</code>
</td>
<td>For a SLEEPJOB type of job, the maximum runtime for reduce
tasks for the job in milliseconds. The default is unlimited.</td>
</tr>
</table>
<p><a name="policies"></a></p></section><section>
<h2><a name="Job_Submission_Policies"></a>Job Submission Policies</h2>
<p>GridMix controls the rate of job submission. This control can be based on the trace information or can be based on statistics it gathers from the ResourceManager. Based on the submission policies users define, GridMix uses the respective algorithm to control the job submission. There are currently three types of policies:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Job Submission Policy</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>STRESS</code>
</td>
<td>Keep submitting jobs so that the cluster remains under stress.
In this mode we control the rate of job submission by monitoring
the real-time load of the cluster so that we can maintain a stable
stress level of workload on the cluster. Based on the statistics we
gather we define if a cluster is *underloaded* or
*overloaded* . We consider a cluster *underloaded* if
and only if the following three conditions are true:
<ol style="list-style-type: decimal">
<li>the number of pending and running jobs are under a threshold
TJ</li>
<li>the number of pending and running maps are under threshold
TM</li>
<li>the number of pending and running reduces are under threshold
TR</li>
</ol>
The thresholds TJ, TM and TR are proportional to the size of the
cluster and map, reduce slots capacities respectively. In case of a
cluster being *overloaded* , we throttle the job submission.
In the actual calculation we also weigh each running task with its
remaining work - namely, a 90% complete task is only counted as 0.1
in calculation. Finally, to avoid a very large job blocking other
jobs, we limit the number of pending/waiting tasks each job can
contribute.</td>
</tr>
<tr class="a">
<td>
<code>REPLAY</code>
</td>
<td>In this mode we replay the job traces faithfully. This mode
exactly follows the time-intervals given in the actual job
trace.</td>
</tr>
<tr class="b">
<td>
<code>SERIAL</code>
</td>
<td>In this mode we submit the next job only once the job submitted
earlier is completed.</td>
</tr>
</table>
<p>The following configuration parameters affect the job submission policy:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>gridmix.job-submission.policy</code>
</td>
<td>The value for this key would be one of the three: STRESS, REPLAY
or SERIAL. In most of the cases the value of key would be STRESS or
REPLAY. The default value is STRESS.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.throttle.jobs-to-tracker-ratio</code>
</td>
<td>In STRESS mode, the minimum ratio of running jobs to
NodeManagers in a cluster for the cluster to be considered
*overloaded* . This is the threshold TJ referred to earlier.
The default is 1.0.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.throttle.maps.task-to-slot-ratio</code>
</td>
<td>In STRESS mode, the minimum ratio of pending and running map
tasks (i.e. incomplete map tasks) to the number of map slots for
a cluster for the cluster to be considered *overloaded* .
This is the threshold TM referred to earlier. Running map tasks are
counted partially. For example, a 40% complete map task is counted
as 0.6 map tasks. The default is 2.0.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.throttle.reduces.task-to-slot-ratio</code>
</td>
<td>In STRESS mode, the minimum ratio of pending and running reduce
tasks (i.e. incomplete reduce tasks) to the number of reduce slots
for a cluster for the cluster to be considered *overloaded* .
This is the threshold TR referred to earlier. Running reduce tasks
are counted partially. For example, a 30% complete reduce task is
counted as 0.7 reduce tasks. The default is 2.5.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.throttle.maps.max-slot-share-per-job</code>
</td>
<td>In STRESS mode, the maximum share of a cluster's map-slots
capacity that can be counted toward a job's incomplete map tasks in
overload calculation. The default is 0.1.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.throttle.reducess.max-slot-share-per-job</code>
</td>
<td>In STRESS mode, the maximum share of a cluster's reduce-slots
capacity that can be counted toward a job's incomplete reduce tasks
in overload calculation. The default is 0.1.</td>
</tr>
</table>
<p><a name="usersqueues"></a></p></section><section>
<h2><a name="Emulating_Users_and_Queues"></a>Emulating Users and Queues</h2>
<p>Typical production clusters are often shared with different users and the cluster capacity is divided among different departments through job queues. Ensuring fairness among jobs from all users, honoring queue capacity allocation policies and avoiding an ill-behaving job from taking over the cluster adds significant complexity in Hadoop software. To be able to sufficiently test and discover bugs in these areas, GridMix must emulate the contentions of jobs from different users and/or submitted to different queues.</p>
<p>Emulating multiple queues is easy - we simply set up the benchmark cluster with the same queue configuration as the production cluster and we configure synthetic jobs so that they get submitted to the same queue as recorded in the trace. However, not all users shown in the trace have accounts on the benchmark cluster. Instead, we set up a number of testing user accounts and associate each unique user in the trace to testing users in a round-robin fashion.</p>
<p>The following configuration parameters affect the emulation of users and queues:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>gridmix.job-submission.use-queue-in-trace</code>
</td>
<td>When set to <code>true</code> it uses exactly the same set of
queues as those mentioned in the trace. The default value is
<code>false</code>.</td>
</tr>
<tr class="a">
<td>
<code>gridmix.job-submission.default-queue</code>
</td>
<td>Specifies the default queue to which all the jobs would be
submitted. If this parameter is not specified, GridMix uses the
default queue defined for the submitting user on the cluster.</td>
</tr>
<tr class="b">
<td>
<code>gridmix.user.resolve.class</code>
</td>
<td>Specifies which <code>UserResolver</code> implementation to use.
We currently have three implementations:
<ol style="list-style-type: decimal">
<li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
- submits a job as the user who submitted the original job. All
the users of the production cluster identified in the job trace
must also have accounts on the benchmark cluster in this case.</li>
<li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
- submits all the jobs as current GridMix user. In this case we
simply map all the users in the trace to the current GridMix user
and submit the job.</li>
<li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
- maps trace users to test users in a round-robin fashion. In
this case we set up a number of testing user accounts and
associate each unique user in the trace to testing users in a
round-robin fashion.</li>
</ol>
The default is
<code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
</tr>
</table>
<p>If the parameter <code>gridmix.user.resolve.class</code> is set to <code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>, we need to define a users-list file with a list of test users. This is specified using the <code>-users</code> option to GridMix.</p>
<p> Specifying a users-list file using the <code>-users</code> option is mandatory when using the round-robin user-resolver. Other user-resolvers ignore this option. </p>
<p>A users-list file has one user per line, each line of the format:</p>
<div class="source">
<div class="source">
<pre>&lt;username&gt;
</pre></div></div>
<p>For example:</p>
<div class="source">
<div class="source">
<pre>user1
user2
user3
</pre></div></div>
<p>In the above example we have defined three users <code>user1</code>, <code>user2</code> and <code>user3</code>. Now we would associate each unique user in the trace to the above users defined in round-robin fashion. For example, if trace&#x2019;s users are <code>tuser1</code>, <code>tuser2</code>, <code>tuser3</code>, <code>tuser4</code> and <code>tuser5</code>, then the mappings would be:</p>
<div class="source">
<div class="source">
<pre>tuser1 -&gt; user1
tuser2 -&gt; user2
tuser3 -&gt; user3
tuser4 -&gt; user1
tuser5 -&gt; user2
</pre></div></div>
<p>For backward compatibility reasons, each line of users-list file can contain username followed by groupnames in the form username[,group]*. The groupnames will be ignored by Gridmix.</p></section><section>
<h2><a name="Emulating_Distributed_Cache_Load"></a>Emulating Distributed Cache Load</h2>
<p>Gridmix emulates Distributed Cache load by default for LOADJOB type of jobs. This is done by precreating the needed Distributed Cache files for all the simulated jobs as part of a separate MapReduce job.</p>
<p>Emulation of Distributed Cache load in gridmix simulated jobs can be disabled by configuring the property <code>gridmix.distributed-cache-emulation.enable</code> to <code>false</code>. But generation of Distributed Cache data by gridmix is driven by <code>-generate</code> option and is independent of this configuration property.</p>
<p>Both generation of Distributed Cache files and emulation of Distributed Cache load are disabled if:</p>
<ul>
<li>input trace comes from the standard input-stream instead of file, or</li>
<li><code>&lt;iopath&gt;</code> specified is on local file-system, or</li>
<li>any of the ascendant directories of the distributed cache directory i.e. <code>&lt;iopath&gt;/distributedCache</code> (including the distributed cache directory) doesn&#x2019;t have execute permission for others.</li>
</ul></section><section>
<h2><a name="Configuration_of_Simulated_Jobs"></a>Configuration of Simulated Jobs</h2>
<p>Gridmix3 sets some configuration properties in the simulated Jobs submitted by it so that they can be mapped back to the corresponding Job in the input Job trace. These configuration parameters include:</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>
<code>gridmix.job.original-job-id</code>
</td>
<td> The job id of the original cluster's job corresponding to this
simulated job.
</td>
</tr>
<tr class="a">
<td>
<code>gridmix.job.original-job-name</code>
</td>
<td> The job name of the original cluster's job corresponding to this
simulated job.
</td>
</tr>
</table>
</section><section>
<h2><a name="Emulating_Compression.2FDecompression"></a>Emulating Compression/Decompression</h2>
<p>MapReduce supports data compression and decompression. Input to a MapReduce job can be compressed. Similarly, output of Map and Reduce tasks can also be compressed. Compression/Decompression emulation in GridMix is important because emulating compression/decompression will effect the CPU and Memory usage of the task. A task emulating compression/decompression will affect other tasks and daemons running on the same node.</p>
<p>Compression emulation is enabled if <code>gridmix.compression-emulation.enable</code> is set to <code>true</code>. By default compression emulation is enabled for jobs of type <i>LOADJOB</i> . With compression emulation enabled, GridMix will now generate compressed text data with a constant compression ratio. Hence a simulated GridMix job will now emulate compression/decompression using compressible text data (having a constant compression ratio), irrespective of the compression ratio observed in the actual job.</p>
<p>A typical MapReduce Job deals with data compression/decompression in the following phases</p>
<ul>
<li>
<p><code>Job input data decompression:</code> GridMix generates compressible input data when compression emulation is enabled. Based on the original job&#x2019;s configuration, a simulated GridMix job will use a decompressor to read the compressed input data. Currently, GridMix uses <code>mapreduce.input.fileinputformat.inputdir</code> to determine if the original job used compressed input data or not. If the original job&#x2019;s input files are uncompressed then the simulated job will read the compressed input file without using a decompressor.</p>
</li>
<li>
<p><code>Intermediate data compression and decompression:</code> If the original job has map output compression enabled then GridMix too will enable map output compression for the simulated job. Accordingly, the reducers will use a decompressor to read the map output data.</p>
</li>
<li>
<p><code>Job output data compression:</code> If the original job&#x2019;s output is compressed then GridMix too will enable job output compression for the simulated job.</p>
</li>
</ul>
<p>The following configuration parameters affect compression emulation</p>
<table border="0" class="bodyTable">
<tr class="a">
<th>Parameter</th>
<th>Description</th>
</tr>
<tr class="b">
<td>gridmix.compression-emulation.enable</td>
<td>Enables compression emulation in simulated GridMix jobs.
Default is true.</td>
</tr>
</table>
<p>With compression emulation turned on, GridMix will generate compressed input data. Hence the total size of the input data will be lesser than the expected size. Set <code>gridmix.min.file.size</code> to a smaller value (roughly 10% of <code>gridmix.gen.bytes.per.file</code>) for enabling GridMix to correctly emulate compression.</p></section><section>
<h2><a name="Emulating_High-Ram_jobs"></a>Emulating High-Ram jobs</h2>
<p>MapReduce allows users to define a job as a High-Ram job. Tasks from a High-Ram job can occupy larger fraction of memory in task processes. Emulating this behavior is important because of the following reasons.</p>
<ul>
<li>
<p>Impact on scheduler: Scheduling of tasks from High-Ram jobs impacts the scheduling behavior as it might result into resource reservation and utilization.</p>
</li>
<li>
<p>Impact on the node : Since High-Ram tasks occupy larger memory, NodeManagers do some bookkeeping for allocating extra resources for these tasks. Thus this becomes a precursor for memory emulation where tasks with high memory requirements needs to be considered as a High-Ram task.</p>
</li>
</ul>
<p>High-Ram feature emulation can be disabled by setting<br />
<code>gridmix.highram-emulation.enable</code> to <code>false</code>.</p></section><section>
<h2><a name="Emulating_resource_usages"></a>Emulating resource usages</h2>
<p>Usages of resources like CPU, physical memory, virtual memory, JVM heap etc are recorded by MapReduce using its task counters. This information is used by GridMix to emulate the resource usages in the simulated tasks. Emulating resource usages will help GridMix exert similar load on the test cluster as seen in the actual cluster.</p>
<p>MapReduce tasks use up resources during its entire lifetime. GridMix also tries to mimic this behavior by spanning resource usage emulation across the entire lifetime of the simulated task. Each resource to be emulated should have an <i>emulator</i> associated with it. Each such <i>emulator</i> should implement the <code>org.apache.hadoop.mapred.gridmix.emulators.resourceusage .ResourceUsageEmulatorPlugin</code> interface. Resource <i>emulators</i> in GridMix are <i>plugins</i> that can be configured (plugged in or out) before every run. GridMix users can configure multiple emulator <i>plugins</i> by passing a comma separated list of <i>emulators</i> as a value for the <code>gridmix.emulators.resource-usage.plugins</code> parameter.</p>
<p>List of <i>emulators</i> shipped with GridMix:</p>
<ul>
<li>
<p>Cumulative CPU usage <i>emulator</i> : GridMix uses the cumulative CPU usage value published by Rumen and makes sure that the total cumulative CPU usage of the simulated task is close to the value published by Rumen. GridMix can be configured to emulate cumulative CPU usage by adding <code>org.apache.hadoop.mapred.gridmix.emulators.resourceusage .CumulativeCpuUsageEmulatorPlugin</code> to the list of emulator <i>plugins</i> configured for the <code>gridmix.emulators.resource-usage.plugins</code> parameter. CPU usage emulator is designed in such a way that it only emulates at specific progress boundaries of the task. This interval can be configured using <code>gridmix.emulators.resource-usage.cpu.emulation-interval</code>. The default value for this parameter is <code>0.1</code> i.e <code>10%</code>.</p>
</li>
<li>
<p>Total heap usage <i>emulator</i> : GridMix uses the total heap usage value published by Rumen and makes sure that the total heap usage of the simulated task is close to the value published by Rumen. GridMix can be configured to emulate total heap usage by adding <code>org.apache.hadoop.mapred.gridmix.emulators.resourceusage .TotalHeapUsageEmulatorPlugin</code> to the list of emulator <i>plugins</i> configured for the <code>gridmix.emulators.resource-usage.plugins</code> parameter. Heap usage emulator is designed in such a way that it only emulates at specific progress boundaries of the task. This interval can be configured using <code>gridmix.emulators.resource-usage.heap.emulation-interval</code>. The default value for this parameter is <code>0.1</code> i.e <code>10%</code> progress interval.</p>
</li>
</ul>
<p>Note that GridMix will emulate resource usages only for jobs of type <i>LOADJOB</i> .</p></section><section>
<h2><a name="Simplifying_Assumptions"></a>Simplifying Assumptions</h2>
<p>GridMix will be developed in stages, incorporating feedback and patches from the community. Currently its intent is to evaluate MapReduce and HDFS performance and not the layers on top of them (i.e. the extensive lib and sub-project space). Given these two limitations, the following characteristics of job load are not currently captured in job traces and cannot be accurately reproduced in GridMix:</p>
<ul>
<li>
<p><i>Filesystem Properties</i> - No attempt is made to match block sizes, namespace hierarchies, or any property of input, intermediate or output data other than the bytes/records consumed and emitted from a given task. This implies that some of the most heavily-used parts of the system - text processing, streaming, etc. - cannot be meaningfully tested with the current implementation.</p>
</li>
<li>
<p><i>I/O Rates</i> - The rate at which records are consumed/emitted is assumed to be limited only by the speed of the reader/writer and constant throughout the task.</p>
</li>
<li>
<p><i>Memory Profile</i> - No data on tasks&#x2019; memory usage over time is available, though the max heap-size is retained.</p>
</li>
<li>
<p><i>Skew</i> - The records consumed and emitted to/from a given task are assumed to follow observed averages, i.e. records will be more regular than may be seen in the wild. Each map also generates a proportional percentage of data for each reduce, so a job with unbalanced input will be flattened.</p>
</li>
<li>
<p><i>Job Failure</i> - User code is assumed to be correct.</p>
</li>
<li>
<p><i>Job Independence</i> - The output or outcome of one job does not affect when or whether a subsequent job will run.</p>
</li>
</ul></section><section>
<h2><a name="Appendix"></a>Appendix</h2>
<p>There exist older versions of the GridMix tool. Issues tracking the original implementations of <a class="externalLink" href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>, and <a class="externalLink" href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a> can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking the current development of GridMix can be found by searching <a class="externalLink" href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">the Apache Hadoop MapReduce JIRA</a>.</p></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>