From 3f982c5c266f2ccf532fa425cc28c0b5c5a2a7e1 Mon Sep 17 00:00:00 2001 From: Allen Wittenauer Date: Thu, 29 Jan 2015 14:17:44 -0800 Subject: [PATCH] MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw) --- hadoop-mapreduce-project/CHANGES.txt | 2 + hadoop-project/src/site/site.xml | 1 + .../src/site/markdown/Rumen.md.vm | 133 ++++++++++++------ 3 files changed, 90 insertions(+), 46 deletions(-) diff --git a/hadoop-mapreduce-project/CHANGES.txt b/hadoop-mapreduce-project/CHANGES.txt index 39ff8cc3a74..496913fa2a2 100644 --- a/hadoop-mapreduce-project/CHANGES.txt +++ b/hadoop-mapreduce-project/CHANGES.txt @@ -264,6 +264,8 @@ Release 2.7.0 - UNRELEASED MAPREDUCE-6141. History server leveldb recovery store (jlowe) + MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw) + OPTIMIZATIONS MAPREDUCE-6169. MergeQueue should release reference to the current item diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index 6fa66484c04..113cb138cc0 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -105,6 +105,7 @@ + diff --git a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm index e25f3a794ae..bee976a0e90 100644 --- a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm +++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm @@ -29,9 +29,7 @@ Rumen - [Components](#Components) - [How to use Rumen?](#How_to_use_Rumen) - [Trace Builder](#Trace_Builder) - - [Example](#Example) - [Folder](#Folder) - - [Examples](#Examples) - [Appendix](#Appendix) - [Resources](#Resources) - [Dependencies](#Dependencies) @@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the desired length. The remaining part of this section explains these utilities in detail. -> Examples in this section assumes that certain libraries are present -> in the java CLASSPATH. See Section-3.2 for more details. +Examples in this section assumes that certain libraries are present +in the java CLASSPATH. See [Dependencies](#Dependencies) for more details. $H3 Trace Builder -`Command:` +$H4 Command - java org.apache.hadoop.tools.rumen.TraceBuilder [options] +``` +java org.apache.hadoop.tools.rumen.TraceBuilder [options] +``` -This command invokes the `TraceBuilder` utility of -*Rumen*. It converts the JobHistory files into a series of JSON +This command invokes the `TraceBuilder` utility of *Rumen*. + +TraceBuilder converts the JobHistory files into a series of JSON objects and writes them into the `` file. It also extracts the cluster layout (topology) and writes it in the`` file. @@ -169,7 +170,7 @@ Cluster topology is used as follows : * To extrapolate splits information for tasks with missing splits details or synthetically generated tasks. -`Options :` +$H4 Options @@ -204,33 +205,45 @@ Cluster topology is used as follows : $H4 Example - java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done +*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*. +One simple way to run Rumen is to use +`$HADOOP_HOME/bin/hadoop jar` command to run it as example below. + +``` +java org.apache.hadoop.tools.rumen.TraceBuilder \ + file:///tmp/job-trace.json \ + file:///tmp/job-topology.json \ + hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser +``` This will analyze all the jobs in - -`/home/user/logs/history/done` stored on the -`local` FileSystem and output the jobtraces in -`/home/user/job-trace.json` along with topology -information in `/home/user/topology.output`. +`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser` +stored on the `HDFS` FileSystem +and output the jobtraces in `/tmp/job-trace.json` +along with topology information in `/tmp/job-topology.json` +stored on the `local` FileSystem. $H3 Folder -`Command`: +$H4 Command - java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] - -> Input and output to `Folder` is expected to be a fully -> qualified FileSystem path. So use file:// to specify -> files on the `local` FileSystem and hdfs:// to -> specify files on HDFS. +``` +java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] +``` This command invokes the `Folder` utility of *Rumen*. Folding essentially means that the output duration of the resulting trace is fixed and job timelines are adjusted to respect the final output duration. -`Options :` +> Input and output to `Folder` is expected to be a fully +> qualified FileSystem path. So use `file://` to specify +> files on the `local` FileSystem and `hdfs://` to +> specify files on HDFS. + + +$H4 Options
@@ -335,14 +348,28 @@ to respect the final output duration. $H4 Examples $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime - - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` If the folded jobs are out of order then the command will bail out. $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -allow-missorting \ + -skew-buffer-length 100 \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` If the folded jobs are out of order, then atmost 100 jobs will be de-skewed. If the 101st job is @@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -debug -temp-directory file:///tmp/debug \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` This will fold the 10hr job-trace file -`file:///home/user/job-trace.json` to finish within 1hr +`file:///tmp/job-trace.json` to finish within 1hr and use `file:///tmp/debug` as the temporary directory. The intermediate files in the temporary directory will not be cleaned up. $H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration. - java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json +``` +java org.apache.hadoop.tools.rumen.Folder \ + -output-duration 1h \ + -input-cycle 20m \ + -concentration 2 \ + file:///tmp/job-trace.json \ + file:///tmp/job-trace-1hr.json +``` This will fold the 10hr job-trace file -`file:///home/user/job-trace.json` to finish within 1hr -with concentration of 2. `Example-2.3.2` will retain 10% -of the jobs. With *concentration* as 2, 20% of the total input -jobs will be retained. +`file:///tmp/job-trace.json` to finish within 1hr +with concentration of 2. +If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default. +With *concentration* as 2, 20% of the total input jobs will be retained. Appendix @@ -377,21 +418,21 @@ $H3 Resources MAPREDUCE-751 is the main JIRA that introduced *Rumen* to *MapReduce*. Look at the MapReduce - -rumen-componentfor further details. +rumen-component +for further details. $H3 Dependencies -*Rumen* expects certain library *JARs* to be present in -the *CLASSPATH*. The required libraries are +*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*. +One simple way to run Rumen is to use +`hadoop jar` command to run it as example below. -* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`) -* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`) -* `Apache Commons Logging` (`commons-logging-1.1.1.jar`) -* `Apache Commons CLI` (`commons-cli-1.2.jar`) -* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`) -* `Jackson Core` (`jackson-core-asl-1.4.2.jar`) - -> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' -> option to run it. +``` +$HADOOP_HOME/bin/hadoop jar \ + $HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \ + org.apache.hadoop.tools.rumen.TraceBuilder \ + file:///tmp/job-trace.json \ + file:///tmp/job-topology.json \ + hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser +```