MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)

This commit is contained in:
Allen Wittenauer 2015-01-29 14:17:44 -08:00
parent ad55083f75
commit 3f982c5c26
3 changed files with 90 additions and 46 deletions

View File

@ -264,6 +264,8 @@ Release 2.7.0 - UNRELEASED
MAPREDUCE-6141. History server leveldb recovery store (jlowe) MAPREDUCE-6141. History server leveldb recovery store (jlowe)
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
OPTIMIZATIONS OPTIMIZATIONS
MAPREDUCE-6169. MergeQueue should release reference to the current item MAPREDUCE-6169. MergeQueue should release reference to the current item

View File

@ -105,6 +105,7 @@
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/> <item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/> <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/> <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
<item name="Rumen" href="hadoop-rumen/Rumen.html"/>
</menu> </menu>
<menu name="MapReduce REST APIs" inherit="top"> <menu name="MapReduce REST APIs" inherit="top">

View File

@ -29,9 +29,7 @@ Rumen
- [Components](#Components) - [Components](#Components)
- [How to use Rumen?](#How_to_use_Rumen) - [How to use Rumen?](#How_to_use_Rumen)
- [Trace Builder](#Trace_Builder) - [Trace Builder](#Trace_Builder)
- [Example](#Example)
- [Folder](#Folder) - [Folder](#Folder)
- [Examples](#Examples)
- [Appendix](#Appendix) - [Appendix](#Appendix)
- [Resources](#Resources) - [Resources](#Resources)
- [Dependencies](#Dependencies) - [Dependencies](#Dependencies)
@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the
desired length. The remaining part of this section explains these desired length. The remaining part of this section explains these
utilities in detail. utilities in detail.
> Examples in this section assumes that certain libraries are present Examples in this section assumes that certain libraries are present
> in the java CLASSPATH. See <em>Section-3.2</em> for more details. in the java CLASSPATH. See [Dependencies](#Dependencies) for more details.
$H3 Trace Builder $H3 Trace Builder
`Command:` $H4 Command
java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs> ```
java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
```
This command invokes the `TraceBuilder` utility of This command invokes the `TraceBuilder` utility of *Rumen*.
*Rumen*. It converts the JobHistory files into a series of JSON
TraceBuilder converts the JobHistory files into a series of JSON
objects and writes them into the `<jobtrace-output>` objects and writes them into the `<jobtrace-output>`
file. It also extracts the cluster layout (topology) and writes it in file. It also extracts the cluster layout (topology) and writes it in
the`<topology-output>` file. the`<topology-output>` file.
@ -169,7 +170,7 @@ Cluster topology is used as follows :
* To extrapolate splits information for tasks with missing splits * To extrapolate splits information for tasks with missing splits
details or synthetically generated tasks. details or synthetically generated tasks.
`Options :` $H4 Options
<table> <table>
<tr> <tr>
@ -204,33 +205,45 @@ Cluster topology is used as follows :
$H4 Example $H4 Example
java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done *Rumen* expects certain library *JARs* to be present in the *CLASSPATH*.
One simple way to run Rumen is to use
`$HADOOP_HOME/bin/hadoop jar` command to run it as example below.
```
java org.apache.hadoop.tools.rumen.TraceBuilder \
file:///tmp/job-trace.json \
file:///tmp/job-topology.json \
hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
```
This will analyze all the jobs in This will analyze all the jobs in
`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser`
`/home/user/logs/history/done` stored on the stored on the `HDFS` FileSystem
`local` FileSystem and output the jobtraces in and output the jobtraces in `/tmp/job-trace.json`
`/home/user/job-trace.json` along with topology along with topology information in `/tmp/job-topology.json`
information in `/home/user/topology.output`. stored on the `local` FileSystem.
$H3 Folder $H3 Folder
`Command`: $H4 Command
java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] ```
java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
> Input and output to `Folder` is expected to be a fully ```
> qualified FileSystem path. So use file:// to specify
> files on the `local` FileSystem and hdfs:// to
> specify files on HDFS.
This command invokes the `Folder` utility of This command invokes the `Folder` utility of
*Rumen*. Folding essentially means that the output duration of *Rumen*. Folding essentially means that the output duration of
the resulting trace is fixed and job timelines are adjusted the resulting trace is fixed and job timelines are adjusted
to respect the final output duration. to respect the final output duration.
`Options :` > Input and output to `Folder` is expected to be a fully
> qualified FileSystem path. So use `file://` to specify
> files on the `local` FileSystem and `hdfs://` to
> specify files on HDFS.
$H4 Options
<table> <table>
<tr> <tr>
@ -335,14 +348,28 @@ to respect the final output duration.
$H4 Examples $H4 Examples
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json ```
java org.apache.hadoop.tools.rumen.Folder \
-output-duration 1h \
-input-cycle 20m \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
```
If the folded jobs are out of order then the command will bail out. If the folded jobs are out of order then the command will bail out.
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json ```
java org.apache.hadoop.tools.rumen.Folder \
-output-duration 1h \
-input-cycle 20m \
-allow-missorting \
-skew-buffer-length 100 \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
```
If the folded jobs are out of order, then atmost If the folded jobs are out of order, then atmost
100 jobs will be de-skewed. If the 101<sup>st</sup> job is 100 jobs will be de-skewed. If the 101<sup>st</sup> job is
@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json ```
java org.apache.hadoop.tools.rumen.Folder \
-output-duration 1h \
-input-cycle 20m \
-debug -temp-directory file:///tmp/debug \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
```
This will fold the 10hr job-trace file This will fold the 10hr job-trace file
`file:///home/user/job-trace.json` to finish within 1hr `file:///tmp/job-trace.json` to finish within 1hr
and use `file:///tmp/debug` as the temporary directory. and use `file:///tmp/debug` as the temporary directory.
The intermediate files in the temporary directory will not be cleaned The intermediate files in the temporary directory will not be cleaned
up. up.
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration. $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source> ```
java org.apache.hadoop.tools.rumen.Folder \
-output-duration 1h \
-input-cycle 20m \
-concentration 2 \
file:///tmp/job-trace.json \
file:///tmp/job-trace-1hr.json
```
This will fold the 10hr job-trace file This will fold the 10hr job-trace file
`file:///home/user/job-trace.json` to finish within 1hr `file:///tmp/job-trace.json` to finish within 1hr
with concentration of 2. `Example-2.3.2` will retain 10% with concentration of 2.
of the jobs. With *concentration* as 2, 20% of the total input If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default.
jobs will be retained. With *concentration* as 2, 20% of the total input jobs will be retained.
Appendix Appendix
@ -377,21 +418,21 @@ $H3 Resources
<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a> <a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
is the main JIRA that introduced *Rumen* to *MapReduce*. is the main JIRA that introduced *Rumen* to *MapReduce*.
Look at the MapReduce Look at the MapReduce
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617"> <a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a>
rumen-component</a>for further details. for further details.
$H3 Dependencies $H3 Dependencies
*Rumen* expects certain library *JARs* to be present in *Rumen* expects certain library *JARs* to be present in the *CLASSPATH*.
the *CLASSPATH*. The required libraries are One simple way to run Rumen is to use
`hadoop jar` command to run it as example below.
* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`) ```
* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`) $HADOOP_HOME/bin/hadoop jar \
* `Apache Commons Logging` (`commons-logging-1.1.1.jar`) $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
* `Apache Commons CLI` (`commons-cli-1.2.jar`) org.apache.hadoop.tools.rumen.TraceBuilder \
* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`) file:///tmp/job-trace.json \
* `Jackson Core` (`jackson-core-asl-1.4.2.jar`) file:///tmp/job-topology.json \
hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' ```
> option to run it.