MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
This commit is contained in:
parent
ad55083f75
commit
3f982c5c26
|
@ -264,6 +264,8 @@ Release 2.7.0 - UNRELEASED
|
||||||
|
|
||||||
MAPREDUCE-6141. History server leveldb recovery store (jlowe)
|
MAPREDUCE-6141. History server leveldb recovery store (jlowe)
|
||||||
|
|
||||||
|
MAPREDUCE-6150. Update document of Rumen (Masatake Iwasaki via aw)
|
||||||
|
|
||||||
OPTIMIZATIONS
|
OPTIMIZATIONS
|
||||||
|
|
||||||
MAPREDUCE-6169. MergeQueue should release reference to the current item
|
MAPREDUCE-6169. MergeQueue should release reference to the current item
|
||||||
|
|
|
@ -105,6 +105,7 @@
|
||||||
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
|
<item name="Hadoop Streaming" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopStreaming.html"/>
|
||||||
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
||||||
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
||||||
|
<item name="Rumen" href="hadoop-rumen/Rumen.html"/>
|
||||||
</menu>
|
</menu>
|
||||||
|
|
||||||
<menu name="MapReduce REST APIs" inherit="top">
|
<menu name="MapReduce REST APIs" inherit="top">
|
||||||
|
|
|
@ -29,9 +29,7 @@ Rumen
|
||||||
- [Components](#Components)
|
- [Components](#Components)
|
||||||
- [How to use Rumen?](#How_to_use_Rumen)
|
- [How to use Rumen?](#How_to_use_Rumen)
|
||||||
- [Trace Builder](#Trace_Builder)
|
- [Trace Builder](#Trace_Builder)
|
||||||
- [Example](#Example)
|
|
||||||
- [Folder](#Folder)
|
- [Folder](#Folder)
|
||||||
- [Examples](#Examples)
|
|
||||||
- [Appendix](#Appendix)
|
- [Appendix](#Appendix)
|
||||||
- [Resources](#Resources)
|
- [Resources](#Resources)
|
||||||
- [Dependencies](#Dependencies)
|
- [Dependencies](#Dependencies)
|
||||||
|
@ -128,18 +126,21 @@ can use the `Folder` utility to fold the current trace to the
|
||||||
desired length. The remaining part of this section explains these
|
desired length. The remaining part of this section explains these
|
||||||
utilities in detail.
|
utilities in detail.
|
||||||
|
|
||||||
> Examples in this section assumes that certain libraries are present
|
Examples in this section assumes that certain libraries are present
|
||||||
> in the java CLASSPATH. See <em>Section-3.2</em> for more details.
|
in the java CLASSPATH. See [Dependencies](#Dependencies) for more details.
|
||||||
|
|
||||||
|
|
||||||
$H3 Trace Builder
|
$H3 Trace Builder
|
||||||
|
|
||||||
`Command:`
|
$H4 Command
|
||||||
|
|
||||||
|
```
|
||||||
java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
|
java org.apache.hadoop.tools.rumen.TraceBuilder [options] <jobtrace-output> <topology-output> <inputs>
|
||||||
|
```
|
||||||
|
|
||||||
This command invokes the `TraceBuilder` utility of
|
This command invokes the `TraceBuilder` utility of *Rumen*.
|
||||||
*Rumen*. It converts the JobHistory files into a series of JSON
|
|
||||||
|
TraceBuilder converts the JobHistory files into a series of JSON
|
||||||
objects and writes them into the `<jobtrace-output>`
|
objects and writes them into the `<jobtrace-output>`
|
||||||
file. It also extracts the cluster layout (topology) and writes it in
|
file. It also extracts the cluster layout (topology) and writes it in
|
||||||
the`<topology-output>` file.
|
the`<topology-output>` file.
|
||||||
|
@ -169,7 +170,7 @@ Cluster topology is used as follows :
|
||||||
* To extrapolate splits information for tasks with missing splits
|
* To extrapolate splits information for tasks with missing splits
|
||||||
details or synthetically generated tasks.
|
details or synthetically generated tasks.
|
||||||
|
|
||||||
`Options :`
|
$H4 Options
|
||||||
|
|
||||||
<table>
|
<table>
|
||||||
<tr>
|
<tr>
|
||||||
|
@ -204,33 +205,45 @@ Cluster topology is used as follows :
|
||||||
|
|
||||||
$H4 Example
|
$H4 Example
|
||||||
|
|
||||||
java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done
|
*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*.
|
||||||
|
One simple way to run Rumen is to use
|
||||||
|
`$HADOOP_HOME/bin/hadoop jar` command to run it as example below.
|
||||||
|
|
||||||
|
```
|
||||||
|
java org.apache.hadoop.tools.rumen.TraceBuilder \
|
||||||
|
file:///tmp/job-trace.json \
|
||||||
|
file:///tmp/job-topology.json \
|
||||||
|
hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
|
||||||
|
```
|
||||||
|
|
||||||
This will analyze all the jobs in
|
This will analyze all the jobs in
|
||||||
|
`/tmp/hadoop-yarn/staging/history/done_intermediate/testuser`
|
||||||
`/home/user/logs/history/done` stored on the
|
stored on the `HDFS` FileSystem
|
||||||
`local` FileSystem and output the jobtraces in
|
and output the jobtraces in `/tmp/job-trace.json`
|
||||||
`/home/user/job-trace.json` along with topology
|
along with topology information in `/tmp/job-topology.json`
|
||||||
information in `/home/user/topology.output`.
|
stored on the `local` FileSystem.
|
||||||
|
|
||||||
|
|
||||||
$H3 Folder
|
$H3 Folder
|
||||||
|
|
||||||
`Command`:
|
$H4 Command
|
||||||
|
|
||||||
|
```
|
||||||
java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
|
java org.apache.hadoop.tools.rumen.Folder [options] [input] [output]
|
||||||
|
```
|
||||||
> Input and output to `Folder` is expected to be a fully
|
|
||||||
> qualified FileSystem path. So use file:// to specify
|
|
||||||
> files on the `local` FileSystem and hdfs:// to
|
|
||||||
> specify files on HDFS.
|
|
||||||
|
|
||||||
This command invokes the `Folder` utility of
|
This command invokes the `Folder` utility of
|
||||||
*Rumen*. Folding essentially means that the output duration of
|
*Rumen*. Folding essentially means that the output duration of
|
||||||
the resulting trace is fixed and job timelines are adjusted
|
the resulting trace is fixed and job timelines are adjusted
|
||||||
to respect the final output duration.
|
to respect the final output duration.
|
||||||
|
|
||||||
`Options :`
|
> Input and output to `Folder` is expected to be a fully
|
||||||
|
> qualified FileSystem path. So use `file://` to specify
|
||||||
|
> files on the `local` FileSystem and `hdfs://` to
|
||||||
|
> specify files on HDFS.
|
||||||
|
|
||||||
|
|
||||||
|
$H4 Options
|
||||||
|
|
||||||
<table>
|
<table>
|
||||||
<tr>
|
<tr>
|
||||||
|
@ -336,13 +349,27 @@ $H4 Examples
|
||||||
|
|
||||||
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
|
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime
|
||||||
|
|
||||||
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
|
```
|
||||||
|
java org.apache.hadoop.tools.rumen.Folder \
|
||||||
|
-output-duration 1h \
|
||||||
|
-input-cycle 20m \
|
||||||
|
file:///tmp/job-trace.json \
|
||||||
|
file:///tmp/job-trace-1hr.json
|
||||||
|
```
|
||||||
|
|
||||||
If the folded jobs are out of order then the command will bail out.
|
If the folded jobs are out of order then the command will bail out.
|
||||||
|
|
||||||
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
|
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness
|
||||||
|
|
||||||
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
|
```
|
||||||
|
java org.apache.hadoop.tools.rumen.Folder \
|
||||||
|
-output-duration 1h \
|
||||||
|
-input-cycle 20m \
|
||||||
|
-allow-missorting \
|
||||||
|
-skew-buffer-length 100 \
|
||||||
|
file:///tmp/job-trace.json \
|
||||||
|
file:///tmp/job-trace-1hr.json
|
||||||
|
```
|
||||||
|
|
||||||
If the folded jobs are out of order, then atmost
|
If the folded jobs are out of order, then atmost
|
||||||
100 jobs will be de-skewed. If the 101<sup>st</sup> job is
|
100 jobs will be de-skewed. If the 101<sup>st</sup> job is
|
||||||
|
@ -350,23 +377,37 @@ If the folded jobs are out of order, then atmost
|
||||||
|
|
||||||
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
|
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode
|
||||||
|
|
||||||
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json
|
```
|
||||||
|
java org.apache.hadoop.tools.rumen.Folder \
|
||||||
|
-output-duration 1h \
|
||||||
|
-input-cycle 20m \
|
||||||
|
-debug -temp-directory file:///tmp/debug \
|
||||||
|
file:///tmp/job-trace.json \
|
||||||
|
file:///tmp/job-trace-1hr.json
|
||||||
|
```
|
||||||
|
|
||||||
This will fold the 10hr job-trace file
|
This will fold the 10hr job-trace file
|
||||||
`file:///home/user/job-trace.json` to finish within 1hr
|
`file:///tmp/job-trace.json` to finish within 1hr
|
||||||
and use `file:///tmp/debug` as the temporary directory.
|
and use `file:///tmp/debug` as the temporary directory.
|
||||||
The intermediate files in the temporary directory will not be cleaned
|
The intermediate files in the temporary directory will not be cleaned
|
||||||
up.
|
up.
|
||||||
|
|
||||||
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
|
$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration.
|
||||||
|
|
||||||
java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json</source>
|
```
|
||||||
|
java org.apache.hadoop.tools.rumen.Folder \
|
||||||
|
-output-duration 1h \
|
||||||
|
-input-cycle 20m \
|
||||||
|
-concentration 2 \
|
||||||
|
file:///tmp/job-trace.json \
|
||||||
|
file:///tmp/job-trace-1hr.json
|
||||||
|
```
|
||||||
|
|
||||||
This will fold the 10hr job-trace file
|
This will fold the 10hr job-trace file
|
||||||
`file:///home/user/job-trace.json` to finish within 1hr
|
`file:///tmp/job-trace.json` to finish within 1hr
|
||||||
with concentration of 2. `Example-2.3.2` will retain 10%
|
with concentration of 2.
|
||||||
of the jobs. With *concentration* as 2, 20% of the total input
|
If the 10h job-trace is folded to 1h, it retains 10% of the jobs by default.
|
||||||
jobs will be retained.
|
With *concentration* as 2, 20% of the total input jobs will be retained.
|
||||||
|
|
||||||
|
|
||||||
Appendix
|
Appendix
|
||||||
|
@ -377,21 +418,21 @@ $H3 Resources
|
||||||
<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
|
<a href="https://issues.apache.org/jira/browse/MAPREDUCE-751">MAPREDUCE-751</a>
|
||||||
is the main JIRA that introduced *Rumen* to *MapReduce*.
|
is the main JIRA that introduced *Rumen* to *MapReduce*.
|
||||||
Look at the MapReduce
|
Look at the MapReduce
|
||||||
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">
|
<a href="https://issues.apache.org/jira/browse/MAPREDUCE/component/12313617">rumen-component</a>
|
||||||
rumen-component</a>for further details.
|
for further details.
|
||||||
|
|
||||||
|
|
||||||
$H3 Dependencies
|
$H3 Dependencies
|
||||||
|
|
||||||
*Rumen* expects certain library *JARs* to be present in
|
*Rumen* expects certain library *JARs* to be present in the *CLASSPATH*.
|
||||||
the *CLASSPATH*. The required libraries are
|
One simple way to run Rumen is to use
|
||||||
|
`hadoop jar` command to run it as example below.
|
||||||
|
|
||||||
* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`)
|
```
|
||||||
* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`)
|
$HADOOP_HOME/bin/hadoop jar \
|
||||||
* `Apache Commons Logging` (`commons-logging-1.1.1.jar`)
|
$HADOOP_HOME/share/hadoop/tools/lib/hadoop-rumen-2.5.1.jar \
|
||||||
* `Apache Commons CLI` (`commons-cli-1.2.jar`)
|
org.apache.hadoop.tools.rumen.TraceBuilder \
|
||||||
* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`)
|
file:///tmp/job-trace.json \
|
||||||
* `Jackson Core` (`jackson-core-asl-1.4.2.jar`)
|
file:///tmp/job-topology.json \
|
||||||
|
hdfs:///tmp/hadoop-yarn/staging/history/done_intermediate/testuser
|
||||||
> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar'
|
```
|
||||||
> option to run it.
|
|
||||||
|
|
Loading…
Reference in New Issue