hadoop/hadoop-rumen/Rumen.html

906 lines
42 KiB
HTML

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!--
| Generated by Apache Maven Doxia at 2023-02-14
| Rendered using Apache Maven Stylus Skin 1.5
-->
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Apache Hadoop Rumen &#x2013; Rumen</title>
<style type="text/css" media="all">
@import url("./css/maven-base.css");
@import url("./css/maven-theme.css");
@import url("./css/site.css");
</style>
<link rel="stylesheet" href="./css/print.css" type="text/css" media="print" />
<meta name="Date-Revision-yyyymmdd" content="20230214" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body class="composite">
<div id="banner">
<a href="http://hadoop.apache.org/" id="bannerLeft">
<img src="http://hadoop.apache.org/images/hadoop-logo.jpg" alt="" />
</a>
<a href="http://www.apache.org/" id="bannerRight">
<img src="http://www.apache.org/images/asf_logo_wide.png" alt="" />
</a>
<div class="clear">
<hr/>
</div>
</div>
<div id="breadcrumbs">
<div class="xright"> <a href="http://wiki.apache.org/hadoop" class="externalLink">Wiki</a>
|
<a href="https://gitbox.apache.org/repos/asf/hadoop.git" class="externalLink">git</a>
&nbsp;| Last Published: 2023-02-14
&nbsp;| Version: 3.4.0-SNAPSHOT
</div>
<div class="clear">
<hr/>
</div>
</div>
<div id="leftColumn">
<div id="navcolumn">
<h5>General</h5>
<ul>
<li class="none">
<a href="../index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SingleCluster.html">Single Node Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ClusterSetup.html">Cluster Setup</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CommandsManual.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FileSystemShell.html">FileSystem Shell</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Compatibility.html">Compatibility Specification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DownstreamDev.html">Downstream Developer's Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AdminCompatibilityGuide.html">Admin Compatibility Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/InterfaceClassification.html">Interface Classification</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/filesystem/index.html">FileSystem Specification</a>
</li>
</ul>
<h5>Common</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CLIMiniCluster.html">CLI Mini Cluster</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/FairCallQueue.html">Fair Call Queue</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/NativeLibraries.html">Native Libraries</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Superusers.html">Proxy User</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/RackAwareness.html">Rack Awareness</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/SecureMode.html">Secure Mode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/ServiceLevelAuth.html">Service Level Authorization</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/HttpAuthentication.html">HTTP Authentication</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/CredentialProviderAPI.html">Credential Provider API</a>
</li>
<li class="none">
<a href="../hadoop-kms/index.html">Hadoop KMS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Tracing.html">Tracing</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellGuide.html">Unix Shell Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/registry/index.html">Registry</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/AsyncProfilerServlet.html">Async Profiler</a>
</li>
</ul>
<h5>HDFS</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDesign.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">User Guide</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html">NameNode HA With QJM</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html">NameNode HA With NFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ObserverNameNode.html">Observer NameNode</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/Federation.html">Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFs.html">ViewFs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ViewFsOverloadScheme.html">ViewFsOverloadScheme</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html">Snapshots</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsEditsViewer.html">Edits Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsImageViewer.html">Image Viewer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html">Permissions and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsQuotaAdminGuide.html">Quotas and HDFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/LibHdfs.html">libhdfs (C API)</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/WebHDFS.html">WebHDFS (REST API)</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/index.html">HttpFS</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ShortCircuitLocalReads.html">Short Circuit Local Reads</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html">Centralized Cache Management</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html">NFS Gateway</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html">Rolling Upgrade</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ExtendedAttributes.html">Extended Attributes</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html">Transparent Encryption</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsMultihoming.html">Multihoming</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/ArchivalStorage.html">Storage Policies</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/MemoryStorage.html">Memory Storage Support</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/SLGUserGuide.html">Synthetic Load Generator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html">Erasure Coding</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HDFSDiskbalancer.html">Disk Balancer</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsUpgradeDomain.html">Upgrade Domain</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsDataNodeAdminGuide.html">DataNode Admin</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/HDFSRouterFederation.html">Router Federation</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/HdfsProvidedStorage.html">Provided Storage</a>
</li>
</ul>
<h5>MapReduce</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html">Tutorial</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html">Compatibility with 1.x</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html">Encrypted Shuffle</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html">Pluggable Shuffle/Sort</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html">Distributed Cache Deploy</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/SharedCacheSupport.html">Support for YARN Shared Cache</a>
</li>
</ul>
<h5>MapReduce REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredAppMasterRest.html">MR Application Master</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-hs/HistoryServerRest.html">MR History Server</a>
</li>
</ul>
<h5>YARN</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YARN.html">Architecture</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnCommands.html">Commands Reference</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html">Capacity Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/FairScheduler.html">Fair Scheduler</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html">ResourceManager Restart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html">ResourceManager HA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceModel.html">Resource Model</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeLabel.html">Node Labels</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeAttributes.html">Node Attributes</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html">Web Application Proxy</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html">Timeline Service V.2</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html">Writing YARN Applications</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html">YARN Application Security</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManager.html">NodeManager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/DockerContainers.html">Running Applications in Docker Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/RuncContainers.html">Running Applications in runC Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerCgroups.html">Using CGroups</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SecureContainer.html">Secure Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ReservationSystem.html">Reservation System</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/GracefulDecommission.html">Graceful Decommission</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/OpportunisticContainers.html">Opportunistic Containers</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/Federation.html">YARN Federation</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/SharedCache.html">Shared Cache</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingGpus.html">Using GPU</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/UsingFPGA.html">Using FPGA</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/PlacementConstraints.html">Placement Constraints</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/YarnUI2.html">YARN UI2</a>
</li>
</ul>
<h5>YARN REST APIs</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/WebServicesIntro.html">Introduction</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html">Resource Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html">Node Manager</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServer.html#Timeline_Server_REST_API_v1">Timeline Server</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/TimelineServiceV2.html#Timeline_Service_v.2_REST_API">Timeline Service V.2</a>
</li>
</ul>
<h5>YARN Service</h5>
<ul>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Overview.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/QuickStart.html">QuickStart</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/Concepts.html">Concepts</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/YarnServiceAPI.html">Yarn Service API</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/ServiceDiscovery.html">Service Discovery</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-site/yarn-service/SystemServices.html">System Services</a>
</li>
</ul>
<h5>Hadoop Compatible File Systems</h5>
<ul>
<li class="none">
<a href="../hadoop-aliyun/tools/hadoop-aliyun/index.html">Aliyun OSS</a>
</li>
<li class="none">
<a href="../hadoop-aws/tools/hadoop-aws/index.html">Amazon S3</a>
</li>
<li class="none">
<a href="../hadoop-azure/index.html">Azure Blob Storage</a>
</li>
<li class="none">
<a href="../hadoop-azure-datalake/index.html">Azure Data Lake Storage</a>
</li>
<li class="none">
<a href="../hadoop-cos/cloud-storage/index.html">Tencent COS</a>
</li>
<li class="none">
<a href="../hadoop-huaweicloud/cloud-storage/index.html">Huaweicloud OBS</a>
</li>
</ul>
<h5>Auth</h5>
<ul>
<li class="none">
<a href="../hadoop-auth/index.html">Overview</a>
</li>
<li class="none">
<a href="../hadoop-auth/Examples.html">Examples</a>
</li>
<li class="none">
<a href="../hadoop-auth/Configuration.html">Configuration</a>
</li>
<li class="none">
<a href="../hadoop-auth/BuildingIt.html">Building</a>
</li>
</ul>
<h5>Tools</h5>
<ul>
<li class="none">
<a href="../hadoop-streaming/HadoopStreaming.html">Hadoop Streaming</a>
</li>
<li class="none">
<a href="../hadoop-archives/HadoopArchives.html">Hadoop Archives</a>
</li>
<li class="none">
<a href="../hadoop-archive-logs/HadoopArchiveLogs.html">Hadoop Archive Logs</a>
</li>
<li class="none">
<a href="../hadoop-distcp/DistCp.html">DistCp</a>
</li>
<li class="none">
<a href="../hadoop-federation-balance/HDFSFederationBalance.html">HDFS Federation Balance</a>
</li>
<li class="none">
<a href="../hadoop-gridmix/GridMix.html">GridMix</a>
</li>
<li class="none">
<a href="../hadoop-rumen/Rumen.html">Rumen</a>
</li>
<li class="none">
<a href="../hadoop-resourceestimator/ResourceEstimator.html">Resource Estimator Service</a>
</li>
<li class="none">
<a href="../hadoop-sls/SchedulerLoadSimulator.html">Scheduler Load Simulator</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Benchmarking.html">Hadoop Benchmarking</a>
</li>
<li class="none">
<a href="../hadoop-dynamometer/Dynamometer.html">Dynamometer</a>
</li>
</ul>
<h5>Reference</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/release/">Changelog and Release Notes</a>
</li>
<li class="none">
<a href="../api/index.html">Java API docs</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/UnixShellAPI.html">Unix Shell API</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/Metrics.html">Metrics</a>
</li>
</ul>
<h5>Configuration</h5>
<ul>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/core-default.xml">core-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs/hdfs-default.xml">hdfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-hdfs-rbf/hdfs-rbf-default.xml">hdfs-rbf-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml">mapred-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-yarn/hadoop-yarn-common/yarn-default.xml">yarn-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-kms/kms-default.html">kms-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-hdfs-httpfs/httpfs-default.html">httpfs-default.xml</a>
</li>
<li class="none">
<a href="../hadoop-project-dist/hadoop-common/DeprecatedProperties.html">Deprecated Properties</a>
</li>
</ul>
<a href="http://maven.apache.org/" title="Built by Maven" class="poweredBy">
<img alt="Built by Maven" src="./images/logos/maven-feather.png"/>
</a>
</div>
</div>
<div id="bodyColumn">
<div id="contentBox">
<!--
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<h1>Rumen</h1><hr />
<ul>
<li><a href="#Overview">Overview</a>
<ul>
<li><a href="#Motivation">Motivation</a></li>
<li><a href="#Components">Components</a></li>
</ul>
</li>
<li><a href="#How_to_use_Rumen">How to use Rumen?</a>
<ul>
<li><a href="#Trace_Builder">Trace Builder</a></li>
<li><a href="#Folder">Folder</a></li>
</ul>
</li>
<li><a href="#Appendix">Appendix</a>
<ul>
<li><a href="#Resources">Resources</a></li>
<li><a href="#Dependencies">Dependencies</a></li>
</ul>
</li>
</ul><hr /><section>
<h2><a name="Overview"></a>Overview</h2>
<p><i>Rumen</i> is a data extraction and analysis tool built for <i>Apache Hadoop</i>. <i>Rumen</i> mines <i>JobHistory</i> logs to extract meaningful data and stores it in an easily-parsed, condensed format or <i>digest</i>. The raw trace data from MapReduce logs are often insufficient for simulation, emulation, and benchmarking, as these tools often attempt to measure conditions that did not occur in the source data. For example, if a task ran locally in the raw trace data but a simulation of the scheduler elects to run that task on a remote rack, the simulator requires a runtime its input cannot provide. To fill in these gaps, Rumen performs a statistical analysis of the digest to estimate the variables the trace doesn&#x2019;t supply. Rumen traces drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and SLS (a simulator for the resource manager scheduler).</p><section>
<h3><a name="Motivation"></a>Motivation</h3>
<ul>
<li>
<p>Extracting meaningful data from <i>JobHistory</i> logs is a common task for any tool built to work on <i>MapReduce</i>. It is tedious to write a custom tool which is so tightly coupled with the <i>MapReduce</i> framework. Hence there is a need for a built-in tool for performing framework level task of log parsing and analysis. Such a tool would insulate external systems depending on job history against the changes made to the job history format.</p>
</li>
<li>
<p>Performing statistical analysis of various attributes of a <i>MapReduce Job</i> such as <i>task runtimes, task failures etc</i> is another common task that the benchmarking and simulation tools might need. <i>Rumen</i> generates <a class="externalLink" href="http://en.wikipedia.org/wiki/Cumulative_distribution_function"> <i>Cumulative Distribution Functions (CDF)</i> </a> for the Map/Reduce task runtimes. Runtime CDF can be used for extrapolating the task runtime of incomplete, missing and synthetic tasks. Similarly CDF is also computed for the total number of successful tasks for every attempt.</p>
</li>
</ul></section><section>
<h3><a name="Components"></a>Components</h3>
<p><i>Rumen</i> consists of 2 components</p>
<ul>
<li>
<p><i>Trace Builder</i> : Converts <i>JobHistory</i> logs into an easily-parsed format. Currently <code>TraceBuilder</code> outputs the trace in <a class="externalLink" href="http://www.json.org/"><i>JSON</i></a> format.</p>
</li>
<li>
<p>*Folder *: A utility to scale the input trace. A trace obtained from <i>TraceBuilder</i> simply summarizes the jobs in the input folders and files. The time-span within which all the jobs in a given trace finish can be considered as the trace runtime. <i>Folder</i> can be used to scale the runtime of a trace. Decreasing the trace runtime might involve dropping some jobs from the input trace and scaling down the runtime of remaining jobs. Increasing the trace runtime might involve adding some dummy jobs to the resulting trace and scaling up the runtime of individual jobs.</p>
</li>
</ul></section></section><section>
<h2><a name="How_to_use_Rumen.3F"></a>How to use Rumen?</h2>
<p>Converting <i>JobHistory</i> logs into a desired job-trace consists of 2 steps</p>
<ol style="list-style-type: decimal">
<li>
<p>Extracting information into an intermediate format</p>
</li>
<li>
<p>Adjusting the job-trace obtained from the intermediate trace to have the desired properties.</p>
</li>
</ol>
<blockquote>
<p>Extracting information from <i>JobHistory</i> logs is a one time operation. This so called <i>Gold Trace</i> can be reused to generate traces with desired values of properties such as <code>output-duration</code>, <code>concentration</code> etc.</p>
</blockquote>
<p><i>Rumen</i> provides 2 basic commands</p>
<ul>
<li><code>TraceBuilder</code></li>
<li><code>Folder</code></li>
</ul>
<p>Firstly, we need to generate the <i>Gold Trace</i>. Hence the first step is to run <code>TraceBuilder</code> on a job-history folder. The output of the <code>TraceBuilder</code> is a job-trace file (and an optional cluster-topology file). In case we want to scale the output, we can use the <code>Folder</code> utility to fold the current trace to the desired length. The remaining part of this section explains these utilities in detail.</p><section>
<h3><a name="Trace_Builder"></a>Trace Builder</h3><section>
<h4><a name="Command"></a>Command</h4>
<div class="source">
<div class="source">
<pre>hadoop rumentrace [options] &lt;jobtrace-output&gt; &lt;topology-output&gt; &lt;inputs&gt;
</pre></div></div>
<p>This command invokes the <code>TraceBuilder</code> utility of <i>Rumen</i>.</p>
<p>TraceBuilder converts the JobHistory files into a series of JSON objects and writes them into the <code>&lt;jobtrace-output&gt;</code> file. It also extracts the cluster layout (topology) and writes it in the<code>&lt;topology-output&gt;</code> file. <code>&lt;inputs&gt;</code> represents a space-separated list of JobHistory files and folders.</p>
<blockquote>
<p>1) Input and output to <code>TraceBuilder</code> is expected to be a fully qualified FileSystem path. So use <a class="externalLink" href="file://">file://</a> to specify files on the <code>local</code> FileSystem and <a class="externalLink" href="hdfs://">hdfs://</a> to specify files on HDFS. Since input files or folder are FileSystem paths, it means that they can be globbed. This can be useful while specifying multiple file paths using regular expressions.</p>
<p>2) By default, TraceBuilder does not recursively scan the input folder for job history files. Only the files that are directly placed under the input folder will be considered for generating the trace. To add all the files under the input directory by recursively scanning the input directory, use &#x2018;-recursive&#x2019; option.</p>
</blockquote>
<p>Cluster topology is used as follows :</p>
<ul>
<li>
<p>To reconstruct the splits and make sure that the distances/latencies seen in the actual run are modeled correctly.</p>
</li>
<li>
<p>To extrapolate splits information for tasks with missing splits details or synthetically generated tasks.</p>
</li>
</ul></section><section>
<h4><a name="Options"></a>Options</h4>
<table border="0" class="bodyTable">
<tr class="a">
<th> Parameter</th>
<th> Description</th>
<th> Notes </th>
</tr>
<tr class="b">
<td><code>-demuxer</code></td>
<td>Used to read the jobhistory files. The default is
<code>DefaultInputDemuxer</code>.</td>
<td>Demuxer decides how the input file maps to jobhistory file(s).
Job history logs and job configuration files are typically small
files, and can be more effectively stored when embedded in some
container file format like SequenceFile or TFile. To support such
usage cases, one can specify a customized Demuxer class that can
extract individual job history logs and job configuration files
from the source files.
</td>
</tr>
<tr class="a">
<td><code>-recursive</code></td>
<td>Recursively traverse input paths for job history logs.</td>
<td>This option should be used to inform the TraceBuilder to
recursively scan the input paths and process all the files under it.
Note that, by default, only the history logs that are directly under
the input folder are considered for generating the trace.
</td>
</tr>
</table>
</section><section>
<h4><a name="Example"></a>Example</h4>
<div class="source">
<div class="source">
<pre>hadoop rumentrace \
file:///tmp/job-trace.json \
file:///tmp/job-topology.json \
hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
</pre></div></div>
<p>This will analyze all the jobs in <code>/tmp/hadoop-yarn/staging/history/done_intermediate/testuser</code> stored on the <code>HDFS</code> FileSystem and output the jobtraces in <code>/tmp/job-trace.json</code> along with topology information in <code>/tmp/job-topology.json</code> stored on the <code>local</code> FileSystem.</p></section></section><section>
<h3><a name="Folder"></a>Folder</h3><section>
<h4><a name="Command"></a>Command</h4>
<div class="source">
<div class="source">
<pre>hadoop rumenfolder [options] [input] [output]
</pre></div></div>
<p>This command invokes the <code>Folder</code> utility of <i>Rumen</i>. Folding essentially means that the output duration of the resulting trace is fixed and job timelines are adjusted to respect the final output duration.</p>
<blockquote>
<p>Input and output to <code>Folder</code> is expected to be a fully qualified FileSystem path. So use <code>file://</code> to specify files on the <code>local</code> FileSystem and <code>hdfs://</code> to specify files on HDFS.</p>
</blockquote></section><section>
<h4><a name="Options"></a>Options</h4>
<table border="0" class="bodyTable">
<tr class="a">
<th> Parameter</th>
<th> Description</th>
<th> Notes </th>
</tr>
<tr class="b">
<td><code>-input-cycle</code></td>
<td>Defines the basic unit of time for the folding operation. There is
no default value for <code>input-cycle</code>.
<b>Input cycle must be provided</b>.
</td>
<td>'<code>-input-cycle 10m</code>'
implies that the whole trace run will be now sliced at a 10min
interval. Basic operations will be done on the 10m chunks. Note
that *Rumen* understands various time units like
<i>m(min), h(hour), d(days) etc</i>.
</td>
</tr>
<tr class="a">
<td><code>-output-duration</code></td>
<td>This parameter defines the final runtime of the trace.
Default value if <b>1 hour</b>.
</td>
<td>'<code>-output-duration 30m</code>'
implies that the resulting trace will have a max runtime of
30mins. All the jobs in the input trace file will be folded and
scaled to fit this window.
</td>
</tr>
<tr class="b">
<td><code>-concentration</code></td>
<td>Set the concentration of the resulting trace. Default value is
<b>1</b>.
</td>
<td>If the total runtime of the resulting trace is less than the total
runtime of the input trace, then the resulting trace would contain
lesser number of jobs as compared to the input trace. This
essentially means that the output is diluted. To increase the
density of jobs, set the concentration to a higher value.</td>
</tr>
<tr class="a">
<td><code>-debug</code></td>
<td>Run the Folder in debug mode. By default it is set to
<b>false</b>.</td>
<td>In debug mode, the Folder will print additional statements for
debugging. Also the intermediate files generated in the scratch
directory will not be cleaned up.
</td>
</tr>
<tr class="b">
<td><code>-seed</code></td>
<td>Initial seed to the Random Number Generator. By default, a Random
Number Generator is used to generate a seed and the seed value is
reported back to the user for future use.
</td>
<td>If an initial seed is passed, then the <code>Random Number
Generator</code> will generate the random numbers in the same
sequence i.e the sequence of random numbers remains same if the
same seed is used. Folder uses Random Number Generator to decide
whether or not to emit the job.
</td>
</tr>
<tr class="a">
<td><code>-temp-directory</code></td>
<td>Temporary directory for the Folder. By default the <b>output
folder's parent directory</b> is used as the scratch space.
</td>
<td>This is the scratch space used by Folder. All the
temporary files are cleaned up in the end unless the Folder is run
in <code>debug</code> mode.</td>
</tr>
<tr class="b">
<td><code>-skew-buffer-length</code></td>
<td>Enables <i>Folder</i> to tolerate skewed jobs.
The default buffer length is <b>0</b>.</td>
<td>'<code>-skew-buffer-length 100</code>'
indicates that if the jobs appear out of order within a window
size of 100, then they will be emitted in-order by the folder.
If a job appears out-of-order outside this window, then the Folder
will bail out provided <code>-allow-missorting</code> is not set.
<i>Folder</i> reports the maximum skew size seen in the
input trace for future use.
</td>
</tr>
<tr class="a">
<td><code>-allow-missorting</code></td>
<td>Enables <i>Folder</i> to tolerate out-of-order jobs. By default
mis-sorting is not allowed.
</td>
<td>If mis-sorting is allowed, then the <i>Folder</i> will ignore
out-of-order jobs that cannot be deskewed using a skew buffer of
size specified using <code>-skew-buffer-length</code>. If
mis-sorting is not allowed, then the Folder will bail out if the
skew buffer is incapable of tolerating the skew.
</td>
</tr>
</table>
</section><section>
<h4><a name="Examples"></a>Examples</h4><section>
<h5><a name="Folding_an_input_trace_with_10_hours_of_total_runtime_to_generate_an_output_trace_with_1_hour_of_total_runtime"></a>Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime</h5>
<div class="source">
<div class="source">
<pre>hadoop rumenfolder \
-output-duration 1h \
-input-cycle 20m \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
</pre></div></div>
<p>If the folded jobs are out of order then the command will bail out.</p></section><section>
<h5><a name="Folding_an_input_trace_with_10_hours_of_total_runtime_to_generate_an_output_trace_with_1_hour_of_total_runtime_and_tolerate_some_skewness"></a>Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness</h5>
<div class="source">
<div class="source">
<pre>hadoop rumenfolder \
-output-duration 1h \
-input-cycle 20m \
-allow-missorting \
-skew-buffer-length 100 \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
</pre></div></div>
<p>If the folded jobs are out of order, then atmost 100 jobs will be de-skewed. If the 101<sup>st</sup> job is <i>out-of-order</i>, then the command will bail out.</p></section><section>
<h5><a name="Folding_an_input_trace_with_10_hours_of_total_runtime_to_generate_an_output_trace_with_1_hour_of_total_runtime_in_debug_mode"></a>Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode</h5>
<div class="source">
<div class="source">
<pre>hadoop rumenfolder \
-output-duration 1h \
-input-cycle 20m \
-debug -temp-directory file:///tmp/debug \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
</pre></div></div>
<p>This will fold the 10hr job-trace file <code>file:///tmp/job-trace.json</code> to finish within 1hr and use <code>file:///tmp/debug</code> as the temporary directory. The intermediate files in the temporary directory will not be cleaned up.</p></section><section>
<h5><a name="Folding_an_input_trace_with_10_hours_of_total_runtime_to_generate_an_output_trace_with_1_hour_of_total_runtime_with_custom_concentration."></a>Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.</h5>
<div class="source">
<div class="source">
<pre>hadoop rumenfolder \
-output-duration 1h \
-input-cycle 20m \
-concentration 2 \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
</pre></div></div>
<p>This will fold the 10hr job-trace file <code>file:///tmp/job-trace.json</code> to finish within 1hr with concentration of 2. If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default. With <i>concentration</i> as 2, 20% of the total input jobs will be retained.</p></section></section></section></section><section>
<h2><a name="Appendix"></a>Appendix</h2><section>
<h3><a name="Resources"></a>Resources</h3>
<p><a class="externalLink" href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a> is the main JIRA that introduced <i>Rumen</i> to <i>MapReduce</i>. Look at the MapReduce <a class="externalLink" href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a> for further details.</p></section></section>
</div>
</div>
<div class="clear">
<hr/>
</div>
<div id="footer">
<div class="xright">
&#169; 2008-2023
Apache Software Foundation
- <a href="http://maven.apache.org/privacy-policy.html">Privacy Policy</a>.
Apache Maven, Maven, Apache, the Apache feather logo, and the Apache Maven project logos are trademarks of The Apache Software Foundation.
</div>
<div class="clear">
<hr/>
</div>
</div>
</body>
</html>