HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki via aw)

2014-09-26 12:01:41 -07:00 · 2014-09-26 12:01:41 -07:00 · 27f9edceed
parent fe8222998f
commit 27f9edceed
3 changed files with 1218 additions and 0 deletions
--- a/hadoop-common-project/hadoop-common/CHANGES.txt
+++ b/hadoop-common-project/hadoop-common/CHANGES.txt
@ -212,6 +212,9 @@ Release 2.6.0 - UNRELEASED
    HADOOP-8808. Update FsShell documentation to mention deprecation of some 
    of the commands, and mention alternatives (Akira AJISAKA via aw)
    HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki 
    via aw)
  OPTIMIZATIONS
    HADOOP-10838. Byte array native checksumming. (James Thomas via todd)
--- a/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
+++ b/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm
@ -0,0 +1,818 @@
 <!---
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
 -->
 Gridmix
 =======
 ---
 - [Overview](#Overview)
 - [Usage](#Usage)
 - [General Configuration Parameters](#General_Configuration_Parameters)
 - [Job Types](#Job_Types)
 - [Job Submission Policies](#Job_Submission_Policies)
 - [Emulating Users and Queues](#Emulating_Users_and_Queues)
 - [Emulating Distributed Cache Load](#Emulating_Distributed_Cache_Load)
 - [Configuration of Simulated Jobs](#Configuration_of_Simulated_Jobs)
 - [Emulating Compression/Decompression](#Emulating_CompressionDecompression)
 - [Emulating High-Ram jobs](#Emulating_High-Ram_jobs)
 - [Emulating resource usages](#Emulating_resource_usages)
 - [Simplifying Assumptions](#Simplifying_Assumptions)
 - [Appendix](#Appendix)
 ---
 Overview
 --------
 GridMix is a benchmark for Hadoop clusters. It submits a mix of
 synthetic jobs, modeling a profile mined from production loads.
 There exist three versions of the GridMix tool. This document
 discusses the third (checked into `src/contrib` ), distinct
 from the two checked into the `src/benchmarks` sub-directory.
 While the first two versions of the tool included stripped-down versions
 of common jobs, both were principally saturation tools for stressing the
 framework at scale. In support of a broader range of deployments and
 finer-tuned job mixes, this version of the tool will attempt to model
 the resource profiles of production jobs to identify bottlenecks, guide
 development, and serve as a replacement for the existing GridMix
 benchmarks.
 To run GridMix, you need a MapReduce job trace describing the job mix
 for a given cluster. Such traces are typically generated by Rumen (see
 Rumen documentation). GridMix also requires input data from which the
 synthetic jobs will be reading bytes. The input data need not be in any
 particular format, as the synthetic jobs are currently binary readers.
 If you are running on a new cluster, an optional step generating input
 data may precede the run.
 In order to emulate the load of production jobs from a given cluster
 on the same or another cluster, follow these steps:
 1.  Locate the job history files on the production cluster. This
    location is specified by the
    `mapred.job.tracker.history.completed.location`
    configuration property of the cluster.
 2.  Run Rumen to build a job trace in JSON format for all or select jobs.
 3.  Use GridMix with the job trace on the benchmark cluster.
 Jobs submitted by GridMix have names of the form
 "`GRIDMIXnnnnnn`", where
 "`nnnnnn`" is a sequence number padded with leading zeroes.
 Usage
 -----
 Basic command-line usage without configuration parameters:
    org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
 Basic command-line usage with configuration parameters:
    org.apache.hadoop.mapred.gridmix.Gridmix \
      -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
      [-generate <size>] [-users <users-list>] <iopath> <trace>
 > Configuration parameters like
 > `-Dgridmix.client.submit.threads=10` and
 > `-Dgridmix.output.directory=foo` as given above should
 > be used  *before*  other GridMix parameters.
 The `<iopath>` parameter is the working directory for
 GridMix. Note that this can either be on the local file-system
 or on HDFS, but it is highly recommended that it be the same as that for
 the original job mix so that GridMix puts the same load on the local
 file-system and HDFS respectively.
 The `-generate` option is used to generate input data and
 Distributed Cache files for the synthetic jobs. It accepts standard units
 of size suffixes, e.g. `100g` will generate
 100 * 2<sup>30</sup> bytes as input data.
 `<iopath>/input` is the destination directory for
 generated input data and/or the directory from which input data will be
 read. HDFS-based Distributed Cache files are generated under the
 distributed cache directory `<iopath>/distributedCache`.
 If some of the needed Distributed Cache files are already existing in the
 distributed cache directory, then only the remaining non-existing
 Distributed Cache files are generated when `-generate` option
 is specified.
 The `-users` option is used to point to a users-list
 file (see <a href="#usersqueues">Emulating Users and Queues</a>).
 The `<trace>` parameter is a path to a job trace
 generated by Rumen. This trace can be compressed (it must be readable
 using one of the compression codecs supported by the cluster) or
 uncompressed. Use "-" as the value of this parameter if you
 want to pass an  *uncompressed*  trace via the standard
 input-stream of GridMix.
 The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
 be found in the JAR
 `contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
 Hadoop installation, where `$VERSION` corresponds to the
 version of Hadoop installed. A simple way of ensuring that this class
 and all its dependencies are loaded correctly is to use the
 `hadoop` wrapper script in Hadoop:
    hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
      [-generate <size>] [-users <users-list>] <iopath> <trace>
 The supported configuration parameters are explained in the
 following sections.
 General Configuration Parameters
 --------------------------------
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>gridmix.output.directory</code>
    </td>
    <td>The directory into which output will be written. If specified,
    <code>iopath</code> will be relative to this parameter. The
    submitting user must have read/write access to this directory. The
    user should also be mindful of any quota issues that may arise
    during a run. The default is "<code>gridmix</code>".</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.client.submit.threads</code>
    </td>
    <td>The number of threads submitting jobs to the cluster. This
    also controls how many splits will be loaded into memory at a given
    time, pending the submit time in the trace. Splits are pre-generated
    to hit submission deadlines, so particularly dense traces may want
    more submitting threads. However, storing splits in memory is
    reasonably expensive, so you should raise this cautiously. The
    default is 1 for the SERIAL job-submission policy (see
    <a href="#policies">Job Submission Policies</a>) and one more than
    the number of processors on the client machine for the other
    policies.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.submit.multiplier</code>
    </td>
    <td>The multiplier to accelerate or decelerate the submission of
    jobs. The time separating two jobs is multiplied by this factor.
    The default value is 1.0. This is a crude mechanism to size
    a job trace to a cluster.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.client.pending.queue.depth</code>
    </td>
    <td>The depth of the queue of job descriptions awaiting split
    generation. The jobs read from the trace occupy a queue of this
    depth before being processed by the submission threads. It is
    unusual to configure this. The default is 5.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.gen.blocksize</code>
    </td>
    <td>The block-size of generated data. The default value is 256
    MiB.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.gen.bytes.per.file</code>
    </td>
    <td>The maximum bytes written per file. The default value is 1
    GiB.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.min.file.size</code>
    </td>
    <td>The minimum size of the input files. The default limit is 128
    MiB. Tweak this parameter if you see an error-message like
    "Found no satisfactory file" while testing GridMix with
    a relatively-small input data-set.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.max.total.scan</code>
    </td>
    <td>The maximum size of the input files. The default limit is 100
    TiB.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.task.jvm-options.enable</code>
    </td>
    <td>Enables Gridmix to configure the simulated task's max heap 
        options using the values obtained from the original task (i.e via
        trace).
    </td>
  </tr>
 </table>
 Job Types
 ---------
 GridMix takes as input a job trace, essentially a stream of
 JSON-encoded job descriptions. For each job description, the submission
 client obtains the original job submission time and for each task in
 that job, the byte and record counts read and written. Given this data,
 it constructs a synthetic job with the same byte and record patterns as
 recorded in the trace. It constructs jobs of two types:
 <table>
  <tr>
    <th>Job Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>LOADJOB</code>
    </td>
    <td>A synthetic job that emulates the workload mentioned in Rumen
    trace. In the current version we are supporting I/O. It reproduces
    the I/O workload on the benchmark cluster. It does so by embedding
    the detailed I/O information for every map and reduce task, such as
    the number of bytes and records read and written, into each
    job's input splits. The map tasks further relay the I/O patterns of
    reduce tasks through the intermediate map output data.</td>
  </tr>
  <tr>
    <td>
      <code>SLEEPJOB</code>
    </td>
    <td>A synthetic job where each task does  *nothing*  but sleep
    for a certain duration as observed in the production trace. The
    scalability of the Job Tracker is often limited by how many
    heartbeats it can handle every second. (Heartbeats are periodic
    messages sent from Task Trackers to update their status and grab new
    tasks from the Job Tracker.) Since a benchmark cluster is typically
    a fraction in size of a production cluster, the heartbeat traffic
    generated by the slave nodes is well below the level of the
    production cluster. One possible solution is to run multiple Task
    Trackers on each slave node. This leads to the obvious problem that
    the I/O workload generated by the synthetic jobs would thrash the
    slave nodes. Hence the need for such a job.</td>
  </tr>
 </table>
 The following configuration parameters affect the job type:
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>gridmix.job.type</code>
    </td>
    <td>The value for this key can be one of LOADJOB or SLEEPJOB. The
    default value is LOADJOB.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.key.fraction</code>
    </td>
    <td>For a LOADJOB type of job, the fraction of a record used for
    the data for the key. The default value is 0.1.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.sleep.maptask-only</code>
    </td>
    <td>For a SLEEPJOB type of job, whether to ignore the reduce
    tasks for the job. The default is <code>false</code>.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.sleep.fake-locations</code>
    </td>
    <td>For a SLEEPJOB type of job, the number of fake locations
    for map tasks for the job. The default is 0.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.sleep.max-map-time</code>
    </td>
    <td>For a SLEEPJOB type of job, the maximum runtime for map
    tasks for the job in milliseconds. The default is unlimited.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.sleep.max-reduce-time</code>
    </td>
    <td>For a SLEEPJOB type of job, the maximum runtime for reduce
    tasks for the job in milliseconds. The default is unlimited.</td>
  </tr>
 </table>
 <a name="policies"></a>
 Job Submission Policies
 -----------------------
 GridMix controls the rate of job submission. This control can be
 based on the trace information or can be based on statistics it gathers
 from the Job Tracker. Based on the submission policies users define,
 GridMix uses the respective algorithm to control the job submission.
 There are currently three types of policies:
 <table>
  <tr>
    <th>Job Submission Policy</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>STRESS</code>
    </td>
    <td>Keep submitting jobs so that the cluster remains under stress.
    In this mode we control the rate of job submission by monitoring
    the real-time load of the cluster so that we can maintain a stable
    stress level of workload on the cluster. Based on the statistics we
    gather we define if a cluster is  *underloaded*  or
     *overloaded* . We consider a cluster  *underloaded*  if
    and only if the following three conditions are true:
    <ol>
      <li>the number of pending and running jobs are under a threshold
      TJ</li>
      <li>the number of pending and running maps are under threshold
      TM</li>
      <li>the number of pending and running reduces are under threshold
      TR</li>
    </ol>
    The thresholds TJ, TM and TR are proportional to the size of the
    cluster and map, reduce slots capacities respectively. In case of a
    cluster being  *overloaded* , we throttle the job submission.
    In the actual calculation we also weigh each running task with its
    remaining work - namely, a 90% complete task is only counted as 0.1
    in calculation. Finally, to avoid a very large job blocking other
    jobs, we limit the number of pending/waiting tasks each job can
    contribute.</td>
  </tr>
  <tr>
    <td>
      <code>REPLAY</code>
    </td>
    <td>In this mode we replay the job traces faithfully. This mode
    exactly follows the time-intervals given in the actual job
    trace.</td>
  </tr>
  <tr>
    <td>
      <code>SERIAL</code>
    </td>
    <td>In this mode we submit the next job only once the job submitted
    earlier is completed.</td>
  </tr>
 </table>
 The following configuration parameters affect the job submission policy:
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>gridmix.job-submission.policy</code>
    </td>
    <td>The value for this key would be one of the three: STRESS, REPLAY
    or SERIAL. In most of the cases the value of key would be STRESS or
    REPLAY. The default value is STRESS.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.throttle.jobs-to-tracker-ratio</code>
    </td>
    <td>In STRESS mode, the minimum ratio of running jobs to Task
    Trackers in a cluster for the cluster to be considered
     *overloaded* . This is the threshold TJ referred to earlier.
    The default is 1.0.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.throttle.maps.task-to-slot-ratio</code>
    </td>
    <td>In STRESS mode, the minimum ratio of pending and running map
    tasks (i.e. incomplete map tasks) to the number of map slots for
    a cluster for the cluster to be considered  *overloaded* .
    This is the threshold TM referred to earlier. Running map tasks are
    counted partially. For example, a 40% complete map task is counted
    as 0.6 map tasks. The default is 2.0.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.throttle.reduces.task-to-slot-ratio</code>
    </td>
    <td>In STRESS mode, the minimum ratio of pending and running reduce
    tasks (i.e. incomplete reduce tasks) to the number of reduce slots
    for a cluster for the cluster to be considered  *overloaded* .
    This is the threshold TR referred to earlier. Running reduce tasks
    are counted partially. For example, a 30% complete reduce task is
    counted as 0.7 reduce tasks. The default is 2.5.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.throttle.maps.max-slot-share-per-job</code>
    </td>
    <td>In STRESS mode, the maximum share of a cluster's map-slots
    capacity that can be counted toward a job's incomplete map tasks in
    overload calculation. The default is 0.1.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.throttle.reducess.max-slot-share-per-job</code>
    </td>
    <td>In STRESS mode, the maximum share of a cluster's reduce-slots
    capacity that can be counted toward a job's incomplete reduce tasks
    in overload calculation. The default is 0.1.</td>
  </tr>
 </table>
 <a name="usersqueues"></a>
 Emulating Users and Queues
 --------------------------
 Typical production clusters are often shared with different users and
 the cluster capacity is divided among different departments through job
 queues. Ensuring fairness among jobs from all users, honoring queue
 capacity allocation policies and avoiding an ill-behaving job from
 taking over the cluster adds significant complexity in Hadoop software.
 To be able to sufficiently test and discover bugs in these areas,
 GridMix must emulate the contentions of jobs from different users and/or
 submitted to different queues.
 Emulating multiple queues is easy - we simply set up the benchmark
 cluster with the same queue configuration as the production cluster and
 we configure synthetic jobs so that they get submitted to the same queue
 as recorded in the trace. However, not all users shown in the trace have
 accounts on the benchmark cluster. Instead, we set up a number of testing
 user accounts and associate each unique user in the trace to testing
 users in a round-robin fashion.
 The following configuration parameters affect the emulation of users
 and queues:
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>gridmix.job-submission.use-queue-in-trace</code>
    </td>
    <td>When set to <code>true</code> it uses exactly the same set of
    queues as those mentioned in the trace. The default value is
    <code>false</code>.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.job-submission.default-queue</code>
    </td>
    <td>Specifies the default queue to which all the jobs would be
    submitted. If this parameter is not specified, GridMix uses the
    default queue defined for the submitting user on the cluster.</td>
  </tr>
  <tr>
    <td>
      <code>gridmix.user.resolve.class</code>
    </td>
    <td>Specifies which <code>UserResolver</code> implementation to use.
    We currently have three implementations:
    <ol>
      <li><code>org.apache.hadoop.mapred.gridmix.EchoUserResolver</code>
      - submits a job as the user who submitted the original job. All
      the users of the production cluster identified in the job trace
      must also have accounts on the benchmark cluster in this case.</li>
      <li><code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>
      - submits all the jobs as current GridMix user. In this case we
      simply map all the users in the trace to the current GridMix user
      and submit the job.</li>
      <li><code>org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver</code>
      - maps trace users to test users in a round-robin fashion. In
      this case we set up a number of testing user accounts and
      associate each unique user in the trace to testing users in a
      round-robin fashion.</li>
    </ol>
    The default is
    <code>org.apache.hadoop.mapred.gridmix.SubmitterUserResolver</code>.</td>
  </tr>
 </table>
 If the parameter `gridmix.user.resolve.class` is set to
 `org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver`,
 we need to define a users-list file with a list of test users.
 This is specified using the `-users` option to GridMix.
 <note>
 Specifying a users-list file using the `-users` option is
 mandatory when using the round-robin user-resolver. Other user-resolvers
 ignore this option.
 </note>
 A users-list file has one user per line, each line of the format:
    <username>
 For example:
    user1
    user2
    user3
 In the above example we have defined three users `user1`, `user2` and `user3`.
 Now we would associate each unique user in the trace to the above users
 defined in round-robin fashion. For example, if trace's users are
 `tuser1`, `tuser2`, `tuser3`, `tuser4` and `tuser5`, then the mappings would be:
    tuser1 -> user1
    tuser2 -> user2
    tuser3 -> user3
    tuser4 -> user1
    tuser5 -> user2
 For backward compatibility reasons, each line of users-list file can
 contain username followed by groupnames in the form username[,group]*.
 The groupnames will be ignored by Gridmix.
 Emulating Distributed Cache Load
 --------------------------------
 Gridmix emulates Distributed Cache load by default for LOADJOB type of
 jobs. This is done by precreating the needed Distributed Cache files for all
 the simulated jobs as part of a separate MapReduce job.
 Emulation of Distributed Cache load in gridmix simulated jobs can be
 disabled by configuring the property
 `gridmix.distributed-cache-emulation.enable` to
 `false`.
 But generation of Distributed Cache data by gridmix is driven by
 `-generate` option and is independent of this configuration
 property.
 Both generation of Distributed Cache files and emulation of
 Distributed Cache load are disabled if:
 * input trace comes from the standard input-stream instead of file, or
 * `<iopath>` specified is on local file-system, or
 * any of the ascendant directories of the distributed cache directory
  i.e. `<iopath>/distributedCache` (including the distributed
  cache directory) doesn't have execute permission for others.
 Configuration of Simulated Jobs
 -------------------------------
 Gridmix3 sets some configuration properties in the simulated Jobs
 submitted by it so that they can be mapped back to the corresponding Job
 in the input Job trace. These configuration parameters include:
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>
      <code>gridmix.job.original-job-id</code>
    </td>
    <td> The job id of the original cluster's job corresponding to this
    simulated job.
    </td>
  </tr>
  <tr>
    <td>
      <code>gridmix.job.original-job-name</code>
    </td>
    <td> The job name of the original cluster's job corresponding to this
    simulated job.
    </td>
  </tr>
 </table>
 Emulating Compression/Decompression
 -----------------------------------
 MapReduce supports data compression and decompression. 
 Input to a MapReduce job can be compressed. Similarly, output of Map
 and Reduce tasks can also be compressed. Compression/Decompression 
 emulation in GridMix is important because emulating 
 compression/decompression will effect the CPU and Memory usage of the 
 task. A task emulating compression/decompression will affect other 
 tasks and daemons running on the same node.
 Compression emulation is enabled if 
 `gridmix.compression-emulation.enable` is set to
 `true`. By default compression emulation is enabled for 
 jobs of type  *LOADJOB* . With compression emulation enabled, 
 GridMix will now generate compressed text data with a constant 
 compression ratio. Hence a simulated GridMix job will now emulate 
 compression/decompression using compressible text data (having a 
 constant compression ratio), irrespective of the compression ratio 
 observed in the actual job.
 A typical MapReduce Job deals with data compression/decompression in 
 the following phases
 * `Job input data decompression: ` GridMix generates 
  compressible input data when compression emulation is enabled. 
  Based on the original job's configuration, a simulated GridMix job 
  will use a decompressor to read the compressed input data. 
  Currently, GridMix uses
  `mapreduce.input.fileinputformat.inputdir` to determine 
  if the original job used compressed input data or 
  not. If the original job's input files are uncompressed then the 
  simulated job will read the compressed input file without using a 
  decompressor. 
 * `Intermediate data compression and decompression: `
  If the original job has map output compression enabled then GridMix 
  too will enable map output compression for the simulated job. 
  Accordingly, the reducers will use a decompressor to read the map 
  output data.
 * `Job output data compression: `
  If the original job's output is compressed then GridMix 
  too will enable job output compression for the simulated job. 
 The following configuration parameters affect compression emulation
 <table>
  <tr>
    <th>Parameter</th>
    <th>Description</th>
  </tr>
  <tr>
    <td>gridmix.compression-emulation.enable</td>
    <td>Enables compression emulation in simulated GridMix jobs. 
        Default is true.</td>
  </tr>
 </table>
 With compression emulation turned on, GridMix will generate compressed
 input data. Hence the total size of the input 
 data will be lesser than the expected size. Set 
 `gridmix.min.file.size` to a smaller value (roughly 10% of
 `gridmix.gen.bytes.per.file`) for enabling GridMix to 
 correctly emulate compression.
 Emulating High-Ram jobs
 -----------------------
 MapReduce allows users to define a job as a High-Ram job. Tasks from a 
 High-Ram job can occupy multiple slots on the task-trackers. 
 Task-tracker assigns fixed virtual memory for each slot. Tasks from 
 High-Ram jobs can occupy multiple slots and thus can use up more 
 virtual memory as compared to a default task.
 Emulating this behavior is important because of the following reasons
 * Impact on scheduler:  Scheduling of tasks from High-Ram jobs 
  impacts the scheduling behavior as it might result into slot 
  reservation and slot/resource utilization.
 * Impact on the node : Since High-Ram tasks occupy multiple slots,
  trackers do some bookkeeping for allocating extra resources for 
  these tasks. Thus this becomes a precursor for memory emulation
  where tasks with high memory requirements needs to be considered
  as a High-Ram task.
 High-Ram feature emulation can be disabled by setting  
 `gridmix.highram-emulation.enable` to `false`.
 Emulating resource usages
 -------------------------
 Usages of resources like CPU, physical memory, virtual memory, JVM heap
 etc are recorded by MapReduce using its task counters. This information
 is used by GridMix to emulate the resource usages in the simulated 
 tasks. Emulating resource usages will help GridMix exert similar load 
 on the test cluster as seen in the actual cluster.
 MapReduce tasks use up resources during its entire lifetime. GridMix
 also tries to mimic this behavior by spanning resource usage emulation
 across the entire lifetime of the simulated task. Each resource to be
 emulated should have an  *emulator*  associated with it.
 Each such  *emulator*  should implement the 
 `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
 .ResourceUsageEmulatorPlugin` interface. Resource 
 *emulators*  in GridMix are  *plugins*  that can be 
 configured (plugged in or out) before every run. GridMix users can 
 configure multiple emulator  *plugins*  by passing a comma 
 separated list of  *emulators*  as a value for the 
 `gridmix.emulators.resource-usage.plugins` parameter. 
 List of  *emulators*  shipped with GridMix:
 * Cumulative CPU usage  *emulator* : 
  GridMix uses the cumulative CPU usage value published by Rumen 
  and makes sure that the total cumulative CPU usage of the simulated 
  task is close to the value published by Rumen. GridMix can be 
  configured to emulate cumulative CPU usage by adding 
  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
  .CumulativeCpuUsageEmulatorPlugin` to the list of emulator 
   *plugins*  configured for the 
  `gridmix.emulators.resource-usage.plugins` parameter.
  CPU usage emulator is designed in such a way that
  it only emulates at specific progress boundaries of the task. This 
  interval can be configured using 
  `gridmix.emulators.resource-usage.cpu.emulation-interval`.
  The default value for this parameter is `0.1` i.e 
  `10%`.
 * Total heap usage  *emulator* : 
  GridMix uses the total heap usage value published by Rumen 
  and makes sure that the total heap usage of the simulated 
  task is close to the value published by Rumen. GridMix can be 
  configured to emulate total heap usage by adding 
  `org.apache.hadoop.mapred.gridmix.emulators.resourceusage
  .TotalHeapUsageEmulatorPlugin` to the list of emulator 
   *plugins*  configured for the 
  `gridmix.emulators.resource-usage.plugins` parameter.
  Heap usage emulator is designed in such a way that
  it only emulates at specific progress boundaries of the task. This 
  interval can be configured using 
  `gridmix.emulators.resource-usage.heap.emulation-interval
  `. The default value for this parameter is `0.1` 
  i.e `10%` progress interval.
 Note that GridMix will emulate resource usages only for jobs of type *LOADJOB* .
 Simplifying Assumptions
 -----------------------
 GridMix will be developed in stages, incorporating feedback and
 patches from the community. Currently its intent is to evaluate
 MapReduce and HDFS performance and not the layers on top of them (i.e.
 the extensive lib and sub-project space). Given these two limitations,
 the following characteristics of job load are not currently captured in
 job traces and cannot be accurately reproduced in GridMix:
 * *Filesystem Properties*  - No attempt is made to match block
  sizes, namespace hierarchies, or any property of input, intermediate
  or output data other than the bytes/records consumed and emitted from
  a given task. This implies that some of the most heavily-used parts of
  the system - text processing, streaming, etc. - cannot be meaningfully tested 
  with the current implementation.
 * *I/O Rates*  - The rate at which records are
  consumed/emitted is assumed to be limited only by the speed of the
  reader/writer and constant throughout the task.
 * *Memory Profile*  - No data on tasks' memory usage over time
  is available, though the max heap-size is retained.
 * *Skew*  - The records consumed and emitted to/from a given
  task are assumed to follow observed averages, i.e. records will be
  more regular than may be seen in the wild. Each map also generates
  a proportional percentage of data for each reduce, so a job with
  unbalanced input will be flattened.
 * *Job Failure*  - User code is assumed to be correct.
 * *Job Independence*  - The output or outcome of one job does
  not affect when or whether a subsequent job will run.
 Appendix
 --------
 Issues tracking the original implementations of
 <a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
 <a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
 and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
 can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
 the current development of GridMix can be found by searching
 <a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
 the Apache Hadoop MapReduce JIRA</a>
--- a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
+++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm
@ -0,0 +1,397 @@
 <!--
  Licensed to the Apache Software Foundation (ASF) under one or more
  contributor license agreements.  See the NOTICE file distributed with
  this work for additional information regarding copyright ownership.
  The ASF licenses this file to You under the Apache License, Version 2.0
  (the "License"); you may not use this file except in compliance with
  the License.  You may obtain a copy of the License at
      http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
 -->
 #set ( $H3 = '###' )
 #set ( $H4 = '####' )
 #set ( $H5 = '#####' )
 Rumen
 =====
 ---
 - [Overview](#Overview)
    - [Motivation](#Motivation)
    - [Components](#Components)
 - [How to use Rumen?](#How_to_use_Rumen)
    - [Trace Builder](#Trace_Builder)
        - [Example](#Example)
    - [Folder](#Folder)
        - [Examples](#Examples)
 - [Appendix](#Appendix)
    - [Resources](#Resources)
    - [Dependencies](#Dependencies)
 ---
 Overview
 --------
 *Rumen* is a data extraction and analysis tool built for
 *Apache Hadoop*. *Rumen* mines *JobHistory* logs to 
 extract meaningful data and stores it in an easily-parsed, condensed 
 format or *digest*. The raw trace data from MapReduce logs are 
 often insufficient for simulation, emulation, and benchmarking, as these 
 tools often attempt to measure conditions that did not occur in the 
 source data. For example, if a task ran locally in the raw trace data 
 but a simulation of the scheduler elects to run that task on a remote 
 rack, the simulator requires a runtime its input cannot provide. 
 To fill in these gaps, Rumen performs a statistical analysis of the 
 digest to estimate the variables the trace doesn't supply. Rumen traces 
 drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and Mumak 
 (a simulator for the JobTracker).
 $H3 Motivation
 * Extracting meaningful data from *JobHistory* logs is a common
  task for any tool built to work on *MapReduce*. It 
  is tedious to write a custom tool which is so tightly coupled with 
  the *MapReduce* framework. Hence there is a need for a 
  built-in tool for performing framework level task of log parsing and
  analysis. Such a tool would insulate external systems depending on 
  job history against the changes made to the job history format.
 * Performing statistical analysis of various attributes of a 
  *MapReduce Job* such as *task runtimes, task failures 
  etc* is another common task that the benchmarking 
  and simulation tools might need. *Rumen* generates 
  <a href="http://en.wikipedia.org/wiki/Cumulative_distribution_function">
  *Cumulative Distribution Functions (CDF)*
  </a> for the Map/Reduce task runtimes. 
  Runtime CDF can be used for extrapolating the task runtime of 
  incomplete, missing and synthetic tasks. Similarly CDF is also 
  computed for the total number of successful tasks for every attempt.
 $H3 Components
 *Rumen* consists of 2 components
 * *Trace Builder* : 
  Converts *JobHistory* logs into an easily-parsed format.
  Currently `TraceBuilder` outputs the trace in 
  <a href="http://www.json.org/">*JSON*</a> 
  format.   
 * *Folder *: 
  A utility to scale the input trace. A trace obtained from
  *TraceBuilder* simply summarizes the jobs in the 
  input folders and files. The time-span within which all the jobs in 
  a given trace finish can be considered as the trace runtime. 
  *Folder* can be used to scale the runtime of a trace.
  Decreasing the trace runtime might involve dropping some jobs from 
  the input trace and scaling down the runtime of remaining jobs. 
  Increasing the trace runtime might involve adding some dummy jobs to
  the resulting trace and scaling up the runtime of individual jobs.
 How to use Rumen?
 -----------------
 Converting *JobHistory* logs into a desired job-trace consists of  2 steps
 1.  Extracting information into an intermediate format
 2.  Adjusting the job-trace obtained from the intermediate trace to 
    have the desired properties.
 > Extracting information from *JobHistory* logs is a one time
 > operation. This so called *Gold Trace* can be reused to
 > generate traces with desired values of properties such as 
 > `output-duration`, `concentration` etc.
 *Rumen* provides 2 basic commands
 * `TraceBuilder`
 * `Folder`
 Firstly, we need to generate the *Gold Trace*. Hence the first 
 step is to run `TraceBuilder` on a job-history folder. 
 The output of the `TraceBuilder` is a job-trace file (and an 
 optional cluster-topology file). In case we want to scale the output, we 
 can use the `Folder` utility to fold the current trace to the 
 desired length. The remaining part of this section explains these 
 utilities in detail.
 > Examples in this section assumes that certain libraries are present 
 > in the java CLASSPATH. See <em>Section-3.2</em> for more details.
 $H3 Trace Builder
 `Command:`
    java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
 This command invokes the `TraceBuilder` utility of
 *Rumen*. It converts the JobHistory files into a series of JSON
 objects and writes them into the `<jobtrace-output>`
 file. It also extracts the cluster layout (topology) and writes it in
 the`<topology-output>` file.
 `<inputs>` represents a space-separated list of
 JobHistory files and folders.
 > 1) Input and output to `TraceBuilder` is expected to
 > be a fully qualified FileSystem path. So use file://
 > to specify files on the `local` FileSystem and 
 > hdfs:// to specify files on HDFS. Since input files or
 > folder are FileSystem paths, it means that they can be globbed.
 > This can be useful while specifying multiple file paths using
 > regular expressions.
 > 2) By default, TraceBuilder does not recursively scan the input
 > folder for job history files. Only the files that are directly
 > placed under the input folder will be considered for generating
 > the trace. To add all the files under the input directory by
 > recursively scanning the input directory, use ‘-recursive’
 > option.
 Cluster topology is used as follows :
 * To reconstruct the splits and make sure that the 
  distances/latencies seen in the actual run are modeled correctly.
 * To extrapolate splits information for tasks with missing splits
  details or synthetically generated tasks.
 `Options :`
 <table>
  <tr>
    <th> Parameter</th>
    <th> Description</th>
    <th> Notes </th>
  </tr>
  <tr>
    <td><code>-demuxer</code></td>
    <td>Used to read the jobhistory files. The default is 
        <code>DefaultInputDemuxer</code>.</td>
    <td>Demuxer decides how the input file maps to jobhistory file(s). 
        Job history logs and job configuration files are typically small 
        files, and can be more effectively stored when embedded in some
        container file format like SequenceFile or TFile. To support such 
        usage cases, one can specify a customized Demuxer class that can 
        extract individual job history logs and job configuration files 
        from the source files.
    </td>
  </tr>
  <tr>
     <td><code>-recursive</code></td>
     <td>Recursively traverse input paths for job history logs.</td>
     <td>This option should be used to inform the TraceBuilder to
     recursively scan the input paths and process all the files under it.
     Note that, by default, only the history logs that are directly under
     the input folder are considered for generating the trace.
     </td>
  </tr>
 </table>
 $H4 Example
    java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
 This will analyze all the jobs in 
 `/home/user/logs/history/done` stored on the 
 `local` FileSystem and output the jobtraces in 
 `/home/user/job-trace.json` along with topology 
 information in `/home/user/topology.output`.
 $H3 Folder
 `Command`:
    java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
 > Input and output to `Folder` is expected to be a fully 
 > qualified FileSystem path. So use file:// to specify 
 > files on the `local` FileSystem and hdfs:// to
 > specify files on HDFS.
 This command invokes the `Folder` utility of 
 *Rumen*. Folding essentially means that the output duration of 
 the resulting trace is fixed and job timelines are adjusted 
 to respect the final output duration. 
 `Options :`
 <table>
  <tr>
    <th> Parameter</th>
    <th> Description</th>
    <th> Notes </th>
  </tr>
  <tr>
    <td><code>-input-cycle</code></td>
    <td>Defines the basic unit of time for the folding operation. There is
        no default value for <code>input-cycle</code>. 
        <strong>Input cycle must be provided</strong>.
    </td>
    <td>'<code>-input-cycle 10m</code>' 
        implies that the whole trace run will be now sliced at a 10min 
        interval. Basic operations will be done on the 10m chunks. Note 
        that *Rumen* understands various time units like 
        <em>m(min), h(hour), d(days) etc</em>.
    </td>
  </tr>
  <tr>
    <td><code>-output-duration</code></td>
    <td>This parameter defines the final runtime of the trace. 
        Default value if <strong>1 hour</strong>.
    </td>
    <td>'<code>-output-duration 30m</code>' 
        implies that the resulting trace will have a max runtime of 
        30mins. All the jobs in the input trace file will be folded and 
        scaled to fit this window.
    </td>
  </tr>
  <tr>
    <td><code>-concentration</code></td>
    <td>Set the concentration of the resulting trace. Default value is 
        <strong>1</strong>.
    </td>
    <td>If the total runtime of the resulting trace is less than the total
        runtime of the input trace, then the resulting trace would contain
        lesser number of jobs as compared to the input trace. This 
        essentially means that the output is diluted. To increase the 
        density of jobs, set the concentration to a higher value.</td>
  </tr>
  <tr>
    <td><code>-debug</code></td>
    <td>Run the Folder in debug mode. By default it is set to 
        <strong>false</strong>.</td>
    <td>In debug mode, the Folder will print additional statements for 
        debugging. Also the intermediate files generated in the scratch 
        directory will not be cleaned up.
    </td>
  </tr>
  <tr>
    <td><code>-seed</code></td>
    <td>Initial seed to the Random Number Generator. By default, a Random 
        Number Generator is used to generate a seed and the seed value is
        reported back to the user for future use.
    </td>
    <td>If an initial seed is passed, then the <code>Random Number 
        Generator</code> will generate the random numbers in the same 
        sequence i.e the sequence of random numbers remains same if the 
        same seed is used. Folder uses Random Number Generator to decide 
        whether or not to emit the job. 
    </td>
  </tr>
  <tr>
    <td><code>-temp-directory</code></td>
    <td>Temporary directory for the Folder. By default the <strong>output
        folder's parent directory</strong> is used as the scratch space.
    </td>
    <td>This is the scratch space used by Folder.  All the 
        temporary files are cleaned up in the end unless the Folder is run
        in <code>debug</code> mode.</td>
  </tr>
  <tr>
    <td><code>-skew-buffer-length</code></td>
    <td>Enables <em>Folder</em> to tolerate skewed jobs.
        The default buffer length is <strong>0</strong>.</td>
    <td>'<code>-skew-buffer-length 100</code>' 
        indicates that if the jobs appear out of order within a window 
        size of 100, then they will be emitted in-order by the folder. 
        If a job appears out-of-order outside this window, then the Folder
        will bail out provided <code>-allow-missorting</code> is not set.
        <em>Folder</em> reports the maximum skew size seen in the 
        input trace for future use.
    </td>
  </tr>
  <tr>
    <td><code>-allow-missorting</code></td>
    <td>Enables <em>Folder</em> to tolerate out-of-order jobs. By default 
        mis-sorting is not allowed.
    </td>
    <td>If mis-sorting is allowed, then the <em>Folder</em> will ignore 
        out-of-order jobs that cannot be deskewed using a skew buffer of
        size specified using <code>-skew-buffer-length</code>. If 
        mis-sorting is not allowed, then the Folder will bail out if the
        skew buffer is incapable of tolerating the skew.
    </td>
  </tr>
 </table>
 $H4 Examples
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h  -input-cycle 20m  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
 If the folded jobs are out of order then the command will bail out. 
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
 If the folded jobs are out of order, then atmost
 100 jobs will be de-skewed. If the 101<sup>st</sup> job is 
 *out-of-order*, then the command will bail out.
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
 This will fold the 10hr job-trace file 
 `file:///home/user/job-trace.json` to finish within 1hr 
 and use `file:///tmp/debug` as the temporary directory. 
 The intermediate files in the temporary directory will not be cleaned
 up.
 $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
    java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m  -concentration 2  file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
 This will fold the 10hr job-trace file 
 `file:///home/user/job-trace.json` to finish within 1hr 
 with concentration of 2. `Example-2.3.2` will retain 10% 
 of the jobs. With *concentration* as 2, 20% of the total input 
 jobs will be retained.
 Appendix
 --------
 $H3 Resources
 <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
 is the main JIRA that introduced *Rumen* to *MapReduce*. 
 Look at the MapReduce
 <a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
 rumen-component</a>for further details.
 $H3 Dependencies
 *Rumen* expects certain library *JARs* to be present in 
 the *CLASSPATH*. The required libraries are
 * `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
 * `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
 * `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
 * `Apache Commons CLI` (`commons-cli-1.2.jar`)
 * `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
 * `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
 > One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' 
 > option  to run it.