MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
This commit is contained in:
parent
f2c91098c4
commit
12e883007c
|
@ -266,6 +266,8 @@ Release 2.7.0 - UNRELEASED
|
|||
|
||||
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
|
||||
|
||||
MAPREDUCE-6151. Update document of GridMix (Masatake Iwasaki via aw)
|
||||
|
||||
OPTIMIZATIONS
|
||||
|
||||
MAPREDUCE-6169. MergeQueue should release reference to the current item
|
||||
|
|
|
@ -105,6 +105,7 @@
|
|||
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
|
||||
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
||||
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
||||
<item name="GridMix" href="hadoop-gridmix/GridMix.html"/>
|
||||
<item name="Rumen" href="hadoop-rumen/Rumen.html"/>
|
||||
</menu>
|
||||
|
||||
|
|
|
@ -38,21 +38,14 @@ Overview
|
|||
|
||||
GridMix is a benchmark for Hadoop clusters. It submits a mix of
|
||||
synthetic jobs, modeling a profile mined from production loads.
|
||||
|
||||
There exist three versions of the GridMix tool. This document
|
||||
discusses the third (checked into `src/contrib` ), distinct
|
||||
from the two checked into the `src/benchmarks` sub-directory.
|
||||
While the first two versions of the tool included stripped-down versions
|
||||
of common jobs, both were principally saturation tools for stressing the
|
||||
framework at scale. In support of a broader range of deployments and
|
||||
finer-tuned job mixes, this version of the tool will attempt to model
|
||||
This version of the tool will attempt to model
|
||||
the resource profiles of production jobs to identify bottlenecks, guide
|
||||
development, and serve as a replacement for the existing GridMix
|
||||
benchmarks.
|
||||
development.
|
||||
|
||||
To run GridMix, you need a MapReduce job trace describing the job mix
|
||||
for a given cluster. Such traces are typically generated by Rumen (see
|
||||
Rumen documentation). GridMix also requires input data from which the
|
||||
for a given cluster. Such traces are typically generated by
|
||||
[Rumen](../hadoop-rumen/Rumen.html).
|
||||
GridMix also requires input data from which the
|
||||
synthetic jobs will be reading bytes. The input data need not be in any
|
||||
particular format, as the synthetic jobs are currently binary readers.
|
||||
If you are running on a new cluster, an optional step generating input
|
||||
|
@ -62,10 +55,15 @@ on the same or another cluster, follow these steps:
|
|||
|
||||
1. Locate the job history files on the production cluster. This
|
||||
location is specified by the
|
||||
`mapred.job.tracker.history.completed.location`
|
||||
`mapreduce.jobhistory.done-dir` or
|
||||
`mapreduce.jobhistory.intermediate-done-dir`
|
||||
configuration property of the cluster.
|
||||
|
||||
2. Run Rumen to build a job trace in JSON format for all or select jobs.
|
||||
([MapReduce historyserver](../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#historyserver)
|
||||
moves job history files from `mapreduce.jobhistory.done-dir`
|
||||
to `mapreduce.jobhistory.intermediate-done-dir`.)
|
||||
|
||||
2. Run [Rumen](../hadoop-rumen/Rumen.html)
|
||||
to build a job trace in JSON format for all or select jobs.
|
||||
|
||||
3. Use GridMix with the job trace on the benchmark cluster.
|
||||
|
||||
|
@ -79,13 +77,17 @@ Usage
|
|||
|
||||
Basic command-line usage without configuration parameters:
|
||||
|
||||
org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
java org.apache.hadoop.mapred.gridmix.Gridmix [-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
|
||||
Basic command-line usage with configuration parameters:
|
||||
|
||||
org.apache.hadoop.mapred.gridmix.Gridmix \
|
||||
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
java org.apache.hadoop.mapred.gridmix.Gridmix \
|
||||
-Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \
|
||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
|
||||
> Configuration parameters like
|
||||
> `-Dgridmix.client.submit.threads=10` and
|
||||
|
@ -102,6 +104,8 @@ The `-generate` option is used to generate input data and
|
|||
Distributed Cache files for the synthetic jobs. It accepts standard units
|
||||
of size suffixes, e.g. `100g` will generate
|
||||
100 * 2<sup>30</sup> bytes as input data.
|
||||
The minimum size of input data in compressed format (128MB by default)
|
||||
is defined by `gridmix.min.file.size`.
|
||||
`<iopath>/input` is the destination directory for
|
||||
generated input data and/or the directory from which input data will be
|
||||
read. HDFS-based Distributed Cache files are generated under the
|
||||
|
@ -121,16 +125,17 @@ uncompressed. Use "-" as the value of this parameter if you
|
|||
want to pass an *uncompressed* trace via the standard
|
||||
input-stream of GridMix.
|
||||
|
||||
The class `org.apache.hadoop.mapred.gridmix.Gridmix` can
|
||||
be found in the JAR
|
||||
`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your
|
||||
Hadoop installation, where `$VERSION` corresponds to the
|
||||
version of Hadoop installed. A simple way of ensuring that this class
|
||||
and all its dependencies are loaded correctly is to use the
|
||||
`hadoop` wrapper script in Hadoop:
|
||||
GridMix expects certain library *JARs* to be present in the *CLASSPATH*.
|
||||
One simple way to run GridMix is to use `hadoop jar` command to run it.
|
||||
You also need to add the JAR of Rumen to classpath for both of client and tasks
|
||||
as example shown below.
|
||||
|
||||
hadoop jar <gridmix-jar> org.apache.hadoop.mapred.gridmix.Gridmix \
|
||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
HADOOP_CLASSPATH=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
||||
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-gridmix-2.5.1.jar \
|
||||
-libjars $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
||||
[-generate <size>] [-users <users-list>] <iopath> <trace>
|
||||
```
|
||||
|
||||
The supported configuration parameters are explained in the
|
||||
following sections.
|
||||
|
@ -262,14 +267,14 @@ recorded in the trace. It constructs jobs of two types:
|
|||
</td>
|
||||
<td>A synthetic job where each task does *nothing* but sleep
|
||||
for a certain duration as observed in the production trace. The
|
||||
scalability of the Job Tracker is often limited by how many
|
||||
scalability of the ResourceManager is often limited by how many
|
||||
heartbeats it can handle every second. (Heartbeats are periodic
|
||||
messages sent from Task Trackers to update their status and grab new
|
||||
tasks from the Job Tracker.) Since a benchmark cluster is typically
|
||||
messages sent from NodeManagers to update their status and grab new
|
||||
tasks from the ResourceManager.) Since a benchmark cluster is typically
|
||||
a fraction in size of a production cluster, the heartbeat traffic
|
||||
generated by the slave nodes is well below the level of the
|
||||
production cluster. One possible solution is to run multiple Task
|
||||
Trackers on each slave node. This leads to the obvious problem that
|
||||
production cluster. One possible solution is to run multiple
|
||||
NodeManagers on each slave node. This leads to the obvious problem that
|
||||
the I/O workload generated by the synthetic jobs would thrash the
|
||||
slave nodes. Hence the need for such a job.</td>
|
||||
</tr>
|
||||
|
@ -334,7 +339,7 @@ Job Submission Policies
|
|||
|
||||
GridMix controls the rate of job submission. This control can be
|
||||
based on the trace information or can be based on statistics it gathers
|
||||
from the Job Tracker. Based on the submission policies users define,
|
||||
from the ResourceManager. Based on the submission policies users define,
|
||||
GridMix uses the respective algorithm to control the job submission.
|
||||
There are currently three types of policies:
|
||||
|
||||
|
@ -407,9 +412,9 @@ The following configuration parameters affect the job submission policy:
|
|||
<td>
|
||||
<code>gridmix.throttle.jobs-to-tracker-ratio</code>
|
||||
</td>
|
||||
<td>In STRESS mode, the minimum ratio of running jobs to Task
|
||||
Trackers in a cluster for the cluster to be considered
|
||||
*overloaded* . This is the threshold TJ referred to earlier.
|
||||
<td>In STRESS mode, the minimum ratio of running jobs to
|
||||
NodeManagers in a cluster for the cluster to be considered
|
||||
*overloaded* . This is the threshold TJ referred to earlier.
|
||||
The default is 1.0.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
@ -688,20 +693,16 @@ correctly emulate compression.
|
|||
Emulating High-Ram jobs
|
||||
-----------------------
|
||||
|
||||
MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
||||
High-Ram job can occupy multiple slots on the task-trackers.
|
||||
Task-tracker assigns fixed virtual memory for each slot. Tasks from
|
||||
High-Ram jobs can occupy multiple slots and thus can use up more
|
||||
virtual memory as compared to a default task.
|
||||
|
||||
Emulating this behavior is important because of the following reasons
|
||||
MapReduce allows users to define a job as a High-Ram job. Tasks from a
|
||||
High-Ram job can occupy larger fraction of memory in task processes.
|
||||
Emulating this behavior is important because of the following reasons.
|
||||
|
||||
* Impact on scheduler: Scheduling of tasks from High-Ram jobs
|
||||
impacts the scheduling behavior as it might result into slot
|
||||
reservation and slot/resource utilization.
|
||||
impacts the scheduling behavior as it might result into
|
||||
resource reservation and utilization.
|
||||
|
||||
* Impact on the node : Since High-Ram tasks occupy multiple slots,
|
||||
trackers do some bookkeeping for allocating extra resources for
|
||||
* Impact on the node : Since High-Ram tasks occupy larger memory,
|
||||
NodeManagers do some bookkeeping for allocating extra resources for
|
||||
these tasks. Thus this becomes a precursor for memory emulation
|
||||
where tasks with high memory requirements needs to be considered
|
||||
as a High-Ram task.
|
||||
|
@ -808,11 +809,11 @@ job traces and cannot be accurately reproduced in GridMix:
|
|||
Appendix
|
||||
--------
|
||||
|
||||
There exist older versions of the GridMix tool.
|
||||
Issues tracking the original implementations of
|
||||
<a href="https://issues.apache.org/jira/browse/HADOOP-2369">GridMix1</a>,
|
||||
<a href="https://issues.apache.org/jira/browse/HADOOP-3770">GridMix2</a>,
|
||||
and <a href="https://issues.apache.org/jira/browse/MAPREDUCE-776">GridMix3</a>
|
||||
[GridMix1](https://issues.apache.org/jira/browse/HADOOP-2369),
|
||||
[GridMix2](https://issues.apache.org/jira/browse/HADOOP-3770),
|
||||
and [GridMix3](https://issues.apache.org/jira/browse/MAPREDUCE-776)
|
||||
can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking
|
||||
the current development of GridMix can be found by searching
|
||||
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086">
|
||||
the Apache Hadoop MapReduce JIRA</a>
|
||||
[the Apache Hadoop MapReduce JIRA](https://issues.apache.org/jira/browse/MAPREDUCE/component/12313086).
|
||||
|
|
Loading…
Reference in New Issue