diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt index a30288fb369..8c6cf164d71 100644 --- a/hadoop-common-project/hadoop-common/CHANGES.txt +++ b/hadoop-common-project/hadoop-common/CHANGES.txt @@ -212,6 +212,9 @@ Release 2.6.0 - UNRELEASED HADOOP-8808. Update FsShell documentation to mention deprecation of some of the commands, and mention alternatives (Akira AJISAKA via aw) + HADOOP-10954. Adding site documents of hadoop-tools (Masatake Iwasaki + via aw) + OPTIMIZATIONS HADOOP-10838. Byte array native checksumming. (James Thomas via todd) diff --git a/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm b/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm new file mode 100644 index 00000000000..53c88915b79 --- /dev/null +++ b/hadoop-tools/hadoop-gridmix/src/site/markdown/GridMix.md.vm @@ -0,0 +1,818 @@ + + +Gridmix +======= + +--- + +- [Overview](#Overview) +- [Usage](#Usage) +- [General Configuration Parameters](#General_Configuration_Parameters) +- [Job Types](#Job_Types) +- [Job Submission Policies](#Job_Submission_Policies) +- [Emulating Users and Queues](#Emulating_Users_and_Queues) +- [Emulating Distributed Cache Load](#Emulating_Distributed_Cache_Load) +- [Configuration of Simulated Jobs](#Configuration_of_Simulated_Jobs) +- [Emulating Compression/Decompression](#Emulating_CompressionDecompression) +- [Emulating High-Ram jobs](#Emulating_High-Ram_jobs) +- [Emulating resource usages](#Emulating_resource_usages) +- [Simplifying Assumptions](#Simplifying_Assumptions) +- [Appendix](#Appendix) + +--- + +Overview +-------- + +GridMix is a benchmark for Hadoop clusters. It submits a mix of +synthetic jobs, modeling a profile mined from production loads. + +There exist three versions of the GridMix tool. This document +discusses the third (checked into `src/contrib` ), distinct +from the two checked into the `src/benchmarks` sub-directory. +While the first two versions of the tool included stripped-down versions +of common jobs, both were principally saturation tools for stressing the +framework at scale. In support of a broader range of deployments and +finer-tuned job mixes, this version of the tool will attempt to model +the resource profiles of production jobs to identify bottlenecks, guide +development, and serve as a replacement for the existing GridMix +benchmarks. + +To run GridMix, you need a MapReduce job trace describing the job mix +for a given cluster. Such traces are typically generated by Rumen (see +Rumen documentation). GridMix also requires input data from which the +synthetic jobs will be reading bytes. The input data need not be in any +particular format, as the synthetic jobs are currently binary readers. +If you are running on a new cluster, an optional step generating input +data may precede the run. +In order to emulate the load of production jobs from a given cluster +on the same or another cluster, follow these steps: + +1. Locate the job history files on the production cluster. This + location is specified by the + `mapred.job.tracker.history.completed.location` + configuration property of the cluster. + +2. Run Rumen to build a job trace in JSON format for all or select jobs. + +3. Use GridMix with the job trace on the benchmark cluster. + +Jobs submitted by GridMix have names of the form +"`GRIDMIXnnnnnn`", where +"`nnnnnn`" is a sequence number padded with leading zeroes. + + +Usage +----- + +Basic command-line usage without configuration parameters: + + org.apache.hadoop.mapred.gridmix.Gridmix [-generate ] [-users ] + +Basic command-line usage with configuration parameters: + + org.apache.hadoop.mapred.gridmix.Gridmix \ + -Dgridmix.client.submit.threads=10 -Dgridmix.output.directory=foo \ + [-generate ] [-users ] + +> Configuration parameters like +> `-Dgridmix.client.submit.threads=10` and +> `-Dgridmix.output.directory=foo` as given above should +> be used *before* other GridMix parameters. + +The `` parameter is the working directory for +GridMix. Note that this can either be on the local file-system +or on HDFS, but it is highly recommended that it be the same as that for +the original job mix so that GridMix puts the same load on the local +file-system and HDFS respectively. + +The `-generate` option is used to generate input data and +Distributed Cache files for the synthetic jobs. It accepts standard units +of size suffixes, e.g. `100g` will generate +100 * 230 bytes as input data. +`/input` is the destination directory for +generated input data and/or the directory from which input data will be +read. HDFS-based Distributed Cache files are generated under the +distributed cache directory `/distributedCache`. +If some of the needed Distributed Cache files are already existing in the +distributed cache directory, then only the remaining non-existing +Distributed Cache files are generated when `-generate` option +is specified. + +The `-users` option is used to point to a users-list +file (see Emulating Users and Queues). + +The `` parameter is a path to a job trace +generated by Rumen. This trace can be compressed (it must be readable +using one of the compression codecs supported by the cluster) or +uncompressed. Use "-" as the value of this parameter if you +want to pass an *uncompressed* trace via the standard +input-stream of GridMix. + +The class `org.apache.hadoop.mapred.gridmix.Gridmix` can +be found in the JAR +`contrib/gridmix/hadoop-gridmix-$VERSION.jar` inside your +Hadoop installation, where `$VERSION` corresponds to the +version of Hadoop installed. A simple way of ensuring that this class +and all its dependencies are loaded correctly is to use the +`hadoop` wrapper script in Hadoop: + + hadoop jar org.apache.hadoop.mapred.gridmix.Gridmix \ + [-generate ] [-users ] + +The supported configuration parameters are explained in the +following sections. + + +General Configuration Parameters +-------------------------------- + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.output.directory + The directory into which output will be written. If specified, + iopath will be relative to this parameter. The + submitting user must have read/write access to this directory. The + user should also be mindful of any quota issues that may arise + during a run. The default is "gridmix".
+ gridmix.client.submit.threads + The number of threads submitting jobs to the cluster. This + also controls how many splits will be loaded into memory at a given + time, pending the submit time in the trace. Splits are pre-generated + to hit submission deadlines, so particularly dense traces may want + more submitting threads. However, storing splits in memory is + reasonably expensive, so you should raise this cautiously. The + default is 1 for the SERIAL job-submission policy (see + Job Submission Policies) and one more than + the number of processors on the client machine for the other + policies.
+ gridmix.submit.multiplier + The multiplier to accelerate or decelerate the submission of + jobs. The time separating two jobs is multiplied by this factor. + The default value is 1.0. This is a crude mechanism to size + a job trace to a cluster.
+ gridmix.client.pending.queue.depth + The depth of the queue of job descriptions awaiting split + generation. The jobs read from the trace occupy a queue of this + depth before being processed by the submission threads. It is + unusual to configure this. The default is 5.
+ gridmix.gen.blocksize + The block-size of generated data. The default value is 256 + MiB.
+ gridmix.gen.bytes.per.file + The maximum bytes written per file. The default value is 1 + GiB.
+ gridmix.min.file.size + The minimum size of the input files. The default limit is 128 + MiB. Tweak this parameter if you see an error-message like + "Found no satisfactory file" while testing GridMix with + a relatively-small input data-set.
+ gridmix.max.total.scan + The maximum size of the input files. The default limit is 100 + TiB.
+ gridmix.task.jvm-options.enable + Enables Gridmix to configure the simulated task's max heap + options using the values obtained from the original task (i.e via + trace). +
+ + +Job Types +--------- + +GridMix takes as input a job trace, essentially a stream of +JSON-encoded job descriptions. For each job description, the submission +client obtains the original job submission time and for each task in +that job, the byte and record counts read and written. Given this data, +it constructs a synthetic job with the same byte and record patterns as +recorded in the trace. It constructs jobs of two types: + + + + + + + + + + + + + + +
Job TypeDescription
+ LOADJOB + A synthetic job that emulates the workload mentioned in Rumen + trace. In the current version we are supporting I/O. It reproduces + the I/O workload on the benchmark cluster. It does so by embedding + the detailed I/O information for every map and reduce task, such as + the number of bytes and records read and written, into each + job's input splits. The map tasks further relay the I/O patterns of + reduce tasks through the intermediate map output data.
+ SLEEPJOB + A synthetic job where each task does *nothing* but sleep + for a certain duration as observed in the production trace. The + scalability of the Job Tracker is often limited by how many + heartbeats it can handle every second. (Heartbeats are periodic + messages sent from Task Trackers to update their status and grab new + tasks from the Job Tracker.) Since a benchmark cluster is typically + a fraction in size of a production cluster, the heartbeat traffic + generated by the slave nodes is well below the level of the + production cluster. One possible solution is to run multiple Task + Trackers on each slave node. This leads to the obvious problem that + the I/O workload generated by the synthetic jobs would thrash the + slave nodes. Hence the need for such a job.
+ +The following configuration parameters affect the job type: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job.type + The value for this key can be one of LOADJOB or SLEEPJOB. The + default value is LOADJOB.
+ gridmix.key.fraction + For a LOADJOB type of job, the fraction of a record used for + the data for the key. The default value is 0.1.
+ gridmix.sleep.maptask-only + For a SLEEPJOB type of job, whether to ignore the reduce + tasks for the job. The default is false.
+ gridmix.sleep.fake-locations + For a SLEEPJOB type of job, the number of fake locations + for map tasks for the job. The default is 0.
+ gridmix.sleep.max-map-time + For a SLEEPJOB type of job, the maximum runtime for map + tasks for the job in milliseconds. The default is unlimited.
+ gridmix.sleep.max-reduce-time + For a SLEEPJOB type of job, the maximum runtime for reduce + tasks for the job in milliseconds. The default is unlimited.
+ + + + +Job Submission Policies +----------------------- + +GridMix controls the rate of job submission. This control can be +based on the trace information or can be based on statistics it gathers +from the Job Tracker. Based on the submission policies users define, +GridMix uses the respective algorithm to control the job submission. +There are currently three types of policies: + + + + + + + + + + + + + + + + + + +
Job Submission PolicyDescription
+ STRESS + Keep submitting jobs so that the cluster remains under stress. + In this mode we control the rate of job submission by monitoring + the real-time load of the cluster so that we can maintain a stable + stress level of workload on the cluster. Based on the statistics we + gather we define if a cluster is *underloaded* or + *overloaded* . We consider a cluster *underloaded* if + and only if the following three conditions are true: +
    +
  1. the number of pending and running jobs are under a threshold + TJ
  2. +
  3. the number of pending and running maps are under threshold + TM
  4. +
  5. the number of pending and running reduces are under threshold + TR
  6. +
+ The thresholds TJ, TM and TR are proportional to the size of the + cluster and map, reduce slots capacities respectively. In case of a + cluster being *overloaded* , we throttle the job submission. + In the actual calculation we also weigh each running task with its + remaining work - namely, a 90% complete task is only counted as 0.1 + in calculation. Finally, to avoid a very large job blocking other + jobs, we limit the number of pending/waiting tasks each job can + contribute.
+ REPLAY + In this mode we replay the job traces faithfully. This mode + exactly follows the time-intervals given in the actual job + trace.
+ SERIAL + In this mode we submit the next job only once the job submitted + earlier is completed.
+ +The following configuration parameters affect the job submission policy: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job-submission.policy + The value for this key would be one of the three: STRESS, REPLAY + or SERIAL. In most of the cases the value of key would be STRESS or + REPLAY. The default value is STRESS.
+ gridmix.throttle.jobs-to-tracker-ratio + In STRESS mode, the minimum ratio of running jobs to Task + Trackers in a cluster for the cluster to be considered + *overloaded* . This is the threshold TJ referred to earlier. + The default is 1.0.
+ gridmix.throttle.maps.task-to-slot-ratio + In STRESS mode, the minimum ratio of pending and running map + tasks (i.e. incomplete map tasks) to the number of map slots for + a cluster for the cluster to be considered *overloaded* . + This is the threshold TM referred to earlier. Running map tasks are + counted partially. For example, a 40% complete map task is counted + as 0.6 map tasks. The default is 2.0.
+ gridmix.throttle.reduces.task-to-slot-ratio + In STRESS mode, the minimum ratio of pending and running reduce + tasks (i.e. incomplete reduce tasks) to the number of reduce slots + for a cluster for the cluster to be considered *overloaded* . + This is the threshold TR referred to earlier. Running reduce tasks + are counted partially. For example, a 30% complete reduce task is + counted as 0.7 reduce tasks. The default is 2.5.
+ gridmix.throttle.maps.max-slot-share-per-job + In STRESS mode, the maximum share of a cluster's map-slots + capacity that can be counted toward a job's incomplete map tasks in + overload calculation. The default is 0.1.
+ gridmix.throttle.reducess.max-slot-share-per-job + In STRESS mode, the maximum share of a cluster's reduce-slots + capacity that can be counted toward a job's incomplete reduce tasks + in overload calculation. The default is 0.1.
+ + + + +Emulating Users and Queues +-------------------------- + +Typical production clusters are often shared with different users and +the cluster capacity is divided among different departments through job +queues. Ensuring fairness among jobs from all users, honoring queue +capacity allocation policies and avoiding an ill-behaving job from +taking over the cluster adds significant complexity in Hadoop software. +To be able to sufficiently test and discover bugs in these areas, +GridMix must emulate the contentions of jobs from different users and/or +submitted to different queues. + +Emulating multiple queues is easy - we simply set up the benchmark +cluster with the same queue configuration as the production cluster and +we configure synthetic jobs so that they get submitted to the same queue +as recorded in the trace. However, not all users shown in the trace have +accounts on the benchmark cluster. Instead, we set up a number of testing +user accounts and associate each unique user in the trace to testing +users in a round-robin fashion. + +The following configuration parameters affect the emulation of users +and queues: + + + + + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job-submission.use-queue-in-trace + When set to true it uses exactly the same set of + queues as those mentioned in the trace. The default value is + false.
+ gridmix.job-submission.default-queue + Specifies the default queue to which all the jobs would be + submitted. If this parameter is not specified, GridMix uses the + default queue defined for the submitting user on the cluster.
+ gridmix.user.resolve.class + Specifies which UserResolver implementation to use. + We currently have three implementations: +
    +
  1. org.apache.hadoop.mapred.gridmix.EchoUserResolver + - submits a job as the user who submitted the original job. All + the users of the production cluster identified in the job trace + must also have accounts on the benchmark cluster in this case.
  2. +
  3. org.apache.hadoop.mapred.gridmix.SubmitterUserResolver + - submits all the jobs as current GridMix user. In this case we + simply map all the users in the trace to the current GridMix user + and submit the job.
  4. +
  5. org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver + - maps trace users to test users in a round-robin fashion. In + this case we set up a number of testing user accounts and + associate each unique user in the trace to testing users in a + round-robin fashion.
  6. +
+ The default is + org.apache.hadoop.mapred.gridmix.SubmitterUserResolver.
+ +If the parameter `gridmix.user.resolve.class` is set to +`org.apache.hadoop.mapred.gridmix.RoundRobinUserResolver`, +we need to define a users-list file with a list of test users. +This is specified using the `-users` option to GridMix. + + +Specifying a users-list file using the `-users` option is +mandatory when using the round-robin user-resolver. Other user-resolvers +ignore this option. + + +A users-list file has one user per line, each line of the format: + + + +For example: + + user1 + user2 + user3 + +In the above example we have defined three users `user1`, `user2` and `user3`. +Now we would associate each unique user in the trace to the above users +defined in round-robin fashion. For example, if trace's users are +`tuser1`, `tuser2`, `tuser3`, `tuser4` and `tuser5`, then the mappings would be: + + tuser1 -> user1 + tuser2 -> user2 + tuser3 -> user3 + tuser4 -> user1 + tuser5 -> user2 + +For backward compatibility reasons, each line of users-list file can +contain username followed by groupnames in the form username[,group]*. +The groupnames will be ignored by Gridmix. + + +Emulating Distributed Cache Load +-------------------------------- + +Gridmix emulates Distributed Cache load by default for LOADJOB type of +jobs. This is done by precreating the needed Distributed Cache files for all +the simulated jobs as part of a separate MapReduce job. + +Emulation of Distributed Cache load in gridmix simulated jobs can be +disabled by configuring the property +`gridmix.distributed-cache-emulation.enable` to +`false`. +But generation of Distributed Cache data by gridmix is driven by +`-generate` option and is independent of this configuration +property. + +Both generation of Distributed Cache files and emulation of +Distributed Cache load are disabled if: + +* input trace comes from the standard input-stream instead of file, or +* `` specified is on local file-system, or +* any of the ascendant directories of the distributed cache directory + i.e. `/distributedCache` (including the distributed + cache directory) doesn't have execute permission for others. + + +Configuration of Simulated Jobs +------------------------------- + +Gridmix3 sets some configuration properties in the simulated Jobs +submitted by it so that they can be mapped back to the corresponding Job +in the input Job trace. These configuration parameters include: + + + + + + + + + + + + + + +
ParameterDescription
+ gridmix.job.original-job-id + The job id of the original cluster's job corresponding to this + simulated job. +
+ gridmix.job.original-job-name + The job name of the original cluster's job corresponding to this + simulated job. +
+ + +Emulating Compression/Decompression +----------------------------------- + +MapReduce supports data compression and decompression. +Input to a MapReduce job can be compressed. Similarly, output of Map +and Reduce tasks can also be compressed. Compression/Decompression +emulation in GridMix is important because emulating +compression/decompression will effect the CPU and Memory usage of the +task. A task emulating compression/decompression will affect other +tasks and daemons running on the same node. + +Compression emulation is enabled if +`gridmix.compression-emulation.enable` is set to +`true`. By default compression emulation is enabled for +jobs of type *LOADJOB* . With compression emulation enabled, +GridMix will now generate compressed text data with a constant +compression ratio. Hence a simulated GridMix job will now emulate +compression/decompression using compressible text data (having a +constant compression ratio), irrespective of the compression ratio +observed in the actual job. + +A typical MapReduce Job deals with data compression/decompression in +the following phases + +* `Job input data decompression: ` GridMix generates + compressible input data when compression emulation is enabled. + Based on the original job's configuration, a simulated GridMix job + will use a decompressor to read the compressed input data. + Currently, GridMix uses + `mapreduce.input.fileinputformat.inputdir` to determine + if the original job used compressed input data or + not. If the original job's input files are uncompressed then the + simulated job will read the compressed input file without using a + decompressor. + +* `Intermediate data compression and decompression: ` + If the original job has map output compression enabled then GridMix + too will enable map output compression for the simulated job. + Accordingly, the reducers will use a decompressor to read the map + output data. + +* `Job output data compression: ` + If the original job's output is compressed then GridMix + too will enable job output compression for the simulated job. + +The following configuration parameters affect compression emulation + + + + + + + + + + +
ParameterDescription
gridmix.compression-emulation.enableEnables compression emulation in simulated GridMix jobs. + Default is true.
+ +With compression emulation turned on, GridMix will generate compressed +input data. Hence the total size of the input +data will be lesser than the expected size. Set +`gridmix.min.file.size` to a smaller value (roughly 10% of +`gridmix.gen.bytes.per.file`) for enabling GridMix to +correctly emulate compression. + + +Emulating High-Ram jobs +----------------------- + +MapReduce allows users to define a job as a High-Ram job. Tasks from a +High-Ram job can occupy multiple slots on the task-trackers. +Task-tracker assigns fixed virtual memory for each slot. Tasks from +High-Ram jobs can occupy multiple slots and thus can use up more +virtual memory as compared to a default task. + +Emulating this behavior is important because of the following reasons + +* Impact on scheduler: Scheduling of tasks from High-Ram jobs + impacts the scheduling behavior as it might result into slot + reservation and slot/resource utilization. + +* Impact on the node : Since High-Ram tasks occupy multiple slots, + trackers do some bookkeeping for allocating extra resources for + these tasks. Thus this becomes a precursor for memory emulation + where tasks with high memory requirements needs to be considered + as a High-Ram task. + +High-Ram feature emulation can be disabled by setting +`gridmix.highram-emulation.enable` to `false`. + + +Emulating resource usages +------------------------- + +Usages of resources like CPU, physical memory, virtual memory, JVM heap +etc are recorded by MapReduce using its task counters. This information +is used by GridMix to emulate the resource usages in the simulated +tasks. Emulating resource usages will help GridMix exert similar load +on the test cluster as seen in the actual cluster. + +MapReduce tasks use up resources during its entire lifetime. GridMix +also tries to mimic this behavior by spanning resource usage emulation +across the entire lifetime of the simulated task. Each resource to be +emulated should have an *emulator* associated with it. +Each such *emulator* should implement the +`org.apache.hadoop.mapred.gridmix.emulators.resourceusage +.ResourceUsageEmulatorPlugin` interface. Resource + *emulators* in GridMix are *plugins* that can be +configured (plugged in or out) before every run. GridMix users can +configure multiple emulator *plugins* by passing a comma +separated list of *emulators* as a value for the +`gridmix.emulators.resource-usage.plugins` parameter. + +List of *emulators* shipped with GridMix: + +* Cumulative CPU usage *emulator* : + GridMix uses the cumulative CPU usage value published by Rumen + and makes sure that the total cumulative CPU usage of the simulated + task is close to the value published by Rumen. GridMix can be + configured to emulate cumulative CPU usage by adding + `org.apache.hadoop.mapred.gridmix.emulators.resourceusage + .CumulativeCpuUsageEmulatorPlugin` to the list of emulator + *plugins* configured for the + `gridmix.emulators.resource-usage.plugins` parameter. + CPU usage emulator is designed in such a way that + it only emulates at specific progress boundaries of the task. This + interval can be configured using + `gridmix.emulators.resource-usage.cpu.emulation-interval`. + The default value for this parameter is `0.1` i.e + `10%`. + +* Total heap usage *emulator* : + GridMix uses the total heap usage value published by Rumen + and makes sure that the total heap usage of the simulated + task is close to the value published by Rumen. GridMix can be + configured to emulate total heap usage by adding + `org.apache.hadoop.mapred.gridmix.emulators.resourceusage + .TotalHeapUsageEmulatorPlugin` to the list of emulator + *plugins* configured for the + `gridmix.emulators.resource-usage.plugins` parameter. + Heap usage emulator is designed in such a way that + it only emulates at specific progress boundaries of the task. This + interval can be configured using + `gridmix.emulators.resource-usage.heap.emulation-interval + `. The default value for this parameter is `0.1` + i.e `10%` progress interval. + +Note that GridMix will emulate resource usages only for jobs of type *LOADJOB* . + + +Simplifying Assumptions +----------------------- + +GridMix will be developed in stages, incorporating feedback and +patches from the community. Currently its intent is to evaluate +MapReduce and HDFS performance and not the layers on top of them (i.e. +the extensive lib and sub-project space). Given these two limitations, +the following characteristics of job load are not currently captured in +job traces and cannot be accurately reproduced in GridMix: + +* *Filesystem Properties* - No attempt is made to match block + sizes, namespace hierarchies, or any property of input, intermediate + or output data other than the bytes/records consumed and emitted from + a given task. This implies that some of the most heavily-used parts of + the system - text processing, streaming, etc. - cannot be meaningfully tested + with the current implementation. + +* *I/O Rates* - The rate at which records are + consumed/emitted is assumed to be limited only by the speed of the + reader/writer and constant throughout the task. + +* *Memory Profile* - No data on tasks' memory usage over time + is available, though the max heap-size is retained. + +* *Skew* - The records consumed and emitted to/from a given + task are assumed to follow observed averages, i.e. records will be + more regular than may be seen in the wild. Each map also generates + a proportional percentage of data for each reduce, so a job with + unbalanced input will be flattened. + +* *Job Failure* - User code is assumed to be correct. + +* *Job Independence* - The output or outcome of one job does + not affect when or whether a subsequent job will run. + + +Appendix +-------- + +Issues tracking the original implementations of +GridMix1, +GridMix2, +and GridMix3 +can be found on the Apache Hadoop MapReduce JIRA. Other issues tracking +the current development of GridMix can be found by searching + +the Apache Hadoop MapReduce JIRA diff --git a/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm new file mode 100644 index 00000000000..e25f3a794ae --- /dev/null +++ b/hadoop-tools/hadoop-rumen/src/site/markdown/Rumen.md.vm @@ -0,0 +1,397 @@ + + +#set ( $H3 = '###' ) +#set ( $H4 = '####' ) +#set ( $H5 = '#####' ) + +Rumen +===== + +--- + +- [Overview](#Overview) + - [Motivation](#Motivation) + - [Components](#Components) +- [How to use Rumen?](#How_to_use_Rumen) + - [Trace Builder](#Trace_Builder) + - [Example](#Example) + - [Folder](#Folder) + - [Examples](#Examples) +- [Appendix](#Appendix) + - [Resources](#Resources) + - [Dependencies](#Dependencies) + +--- + +Overview +-------- + +*Rumen* is a data extraction and analysis tool built for +*Apache Hadoop*. *Rumen* mines *JobHistory* logs to +extract meaningful data and stores it in an easily-parsed, condensed +format or *digest*. The raw trace data from MapReduce logs are +often insufficient for simulation, emulation, and benchmarking, as these +tools often attempt to measure conditions that did not occur in the +source data. For example, if a task ran locally in the raw trace data +but a simulation of the scheduler elects to run that task on a remote +rack, the simulator requires a runtime its input cannot provide. +To fill in these gaps, Rumen performs a statistical analysis of the +digest to estimate the variables the trace doesn't supply. Rumen traces +drive both Gridmix (a benchmark of Hadoop MapReduce clusters) and Mumak +(a simulator for the JobTracker). + + +$H3 Motivation + +* Extracting meaningful data from *JobHistory* logs is a common + task for any tool built to work on *MapReduce*. It + is tedious to write a custom tool which is so tightly coupled with + the *MapReduce* framework. Hence there is a need for a + built-in tool for performing framework level task of log parsing and + analysis. Such a tool would insulate external systems depending on + job history against the changes made to the job history format. + +* Performing statistical analysis of various attributes of a + *MapReduce Job* such as *task runtimes, task failures + etc* is another common task that the benchmarking + and simulation tools might need. *Rumen* generates + + *Cumulative Distribution Functions (CDF)* + for the Map/Reduce task runtimes. + Runtime CDF can be used for extrapolating the task runtime of + incomplete, missing and synthetic tasks. Similarly CDF is also + computed for the total number of successful tasks for every attempt. + + +$H3 Components + +*Rumen* consists of 2 components + +* *Trace Builder* : + Converts *JobHistory* logs into an easily-parsed format. + Currently `TraceBuilder` outputs the trace in + *JSON* + format. + +* *Folder *: + A utility to scale the input trace. A trace obtained from + *TraceBuilder* simply summarizes the jobs in the + input folders and files. The time-span within which all the jobs in + a given trace finish can be considered as the trace runtime. + *Folder* can be used to scale the runtime of a trace. + Decreasing the trace runtime might involve dropping some jobs from + the input trace and scaling down the runtime of remaining jobs. + Increasing the trace runtime might involve adding some dummy jobs to + the resulting trace and scaling up the runtime of individual jobs. + + +How to use Rumen? +----------------- + +Converting *JobHistory* logs into a desired job-trace consists of 2 steps + +1. Extracting information into an intermediate format + +2. Adjusting the job-trace obtained from the intermediate trace to + have the desired properties. + +> Extracting information from *JobHistory* logs is a one time +> operation. This so called *Gold Trace* can be reused to +> generate traces with desired values of properties such as +> `output-duration`, `concentration` etc. + +*Rumen* provides 2 basic commands + +* `TraceBuilder` +* `Folder` + +Firstly, we need to generate the *Gold Trace*. Hence the first +step is to run `TraceBuilder` on a job-history folder. +The output of the `TraceBuilder` is a job-trace file (and an +optional cluster-topology file). In case we want to scale the output, we +can use the `Folder` utility to fold the current trace to the +desired length. The remaining part of this section explains these +utilities in detail. + +> Examples in this section assumes that certain libraries are present +> in the java CLASSPATH. See Section-3.2 for more details. + + +$H3 Trace Builder + +`Command:` + + java org.apache.hadoop.tools.rumen.TraceBuilder [options] + +This command invokes the `TraceBuilder` utility of +*Rumen*. It converts the JobHistory files into a series of JSON +objects and writes them into the `` +file. It also extracts the cluster layout (topology) and writes it in +the`` file. +`` represents a space-separated list of +JobHistory files and folders. + +> 1) Input and output to `TraceBuilder` is expected to +> be a fully qualified FileSystem path. So use file:// +> to specify files on the `local` FileSystem and +> hdfs:// to specify files on HDFS. Since input files or +> folder are FileSystem paths, it means that they can be globbed. +> This can be useful while specifying multiple file paths using +> regular expressions. + +> 2) By default, TraceBuilder does not recursively scan the input +> folder for job history files. Only the files that are directly +> placed under the input folder will be considered for generating +> the trace. To add all the files under the input directory by +> recursively scanning the input directory, use ‘-recursive’ +> option. + +Cluster topology is used as follows : + +* To reconstruct the splits and make sure that the + distances/latencies seen in the actual run are modeled correctly. + +* To extrapolate splits information for tasks with missing splits + details or synthetically generated tasks. + +`Options :` + + + + + + + + + + + + + + + + + +
Parameter Description Notes
-demuxerUsed to read the jobhistory files. The default is + DefaultInputDemuxer.Demuxer decides how the input file maps to jobhistory file(s). + Job history logs and job configuration files are typically small + files, and can be more effectively stored when embedded in some + container file format like SequenceFile or TFile. To support such + usage cases, one can specify a customized Demuxer class that can + extract individual job history logs and job configuration files + from the source files. +
-recursiveRecursively traverse input paths for job history logs.This option should be used to inform the TraceBuilder to + recursively scan the input paths and process all the files under it. + Note that, by default, only the history logs that are directly under + the input folder are considered for generating the trace. +
+ + +$H4 Example + + java org.apache.hadoop.tools.rumen.TraceBuilder file:///home/user/job-trace.json file:///home/user/topology.output file:///home/user/logs/history/done + +This will analyze all the jobs in + +`/home/user/logs/history/done` stored on the +`local` FileSystem and output the jobtraces in +`/home/user/job-trace.json` along with topology +information in `/home/user/topology.output`. + + +$H3 Folder + +`Command`: + + java org.apache.hadoop.tools.rumen.Folder [options] [input] [output] + +> Input and output to `Folder` is expected to be a fully +> qualified FileSystem path. So use file:// to specify +> files on the `local` FileSystem and hdfs:// to +> specify files on HDFS. + +This command invokes the `Folder` utility of +*Rumen*. Folding essentially means that the output duration of +the resulting trace is fixed and job timelines are adjusted +to respect the final output duration. + +`Options :` + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Parameter Description Notes
-input-cycleDefines the basic unit of time for the folding operation. There is + no default value for input-cycle. + Input cycle must be provided. + '-input-cycle 10m' + implies that the whole trace run will be now sliced at a 10min + interval. Basic operations will be done on the 10m chunks. Note + that *Rumen* understands various time units like + m(min), h(hour), d(days) etc. +
-output-durationThis parameter defines the final runtime of the trace. + Default value if 1 hour. + '-output-duration 30m' + implies that the resulting trace will have a max runtime of + 30mins. All the jobs in the input trace file will be folded and + scaled to fit this window. +
-concentrationSet the concentration of the resulting trace. Default value is + 1. + If the total runtime of the resulting trace is less than the total + runtime of the input trace, then the resulting trace would contain + lesser number of jobs as compared to the input trace. This + essentially means that the output is diluted. To increase the + density of jobs, set the concentration to a higher value.
-debugRun the Folder in debug mode. By default it is set to + false.In debug mode, the Folder will print additional statements for + debugging. Also the intermediate files generated in the scratch + directory will not be cleaned up. +
-seedInitial seed to the Random Number Generator. By default, a Random + Number Generator is used to generate a seed and the seed value is + reported back to the user for future use. + If an initial seed is passed, then the Random Number + Generator will generate the random numbers in the same + sequence i.e the sequence of random numbers remains same if the + same seed is used. Folder uses Random Number Generator to decide + whether or not to emit the job. +
-temp-directoryTemporary directory for the Folder. By default the output + folder's parent directory is used as the scratch space. + This is the scratch space used by Folder. All the + temporary files are cleaned up in the end unless the Folder is run + in debug mode.
-skew-buffer-lengthEnables Folder to tolerate skewed jobs. + The default buffer length is 0.'-skew-buffer-length 100' + indicates that if the jobs appear out of order within a window + size of 100, then they will be emitted in-order by the folder. + If a job appears out-of-order outside this window, then the Folder + will bail out provided -allow-missorting is not set. + Folder reports the maximum skew size seen in the + input trace for future use. +
-allow-missortingEnables Folder to tolerate out-of-order jobs. By default + mis-sorting is not allowed. + If mis-sorting is allowed, then the Folder will ignore + out-of-order jobs that cannot be deskewed using a skew buffer of + size specified using -skew-buffer-length. If + mis-sorting is not allowed, then the Folder will bail out if the + skew buffer is incapable of tolerating the skew. +
+ + +$H4 Examples + +$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime + + java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +If the folded jobs are out of order then the command will bail out. + +$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime and tolerate some skewness + + java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -allow-missorting -skew-buffer-length 100 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +If the folded jobs are out of order, then atmost +100 jobs will be de-skewed. If the 101st job is +*out-of-order*, then the command will bail out. + +$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime in debug mode + + java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -debug -temp-directory file:///tmp/debug file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +This will fold the 10hr job-trace file +`file:///home/user/job-trace.json` to finish within 1hr +and use `file:///tmp/debug` as the temporary directory. +The intermediate files in the temporary directory will not be cleaned +up. + +$H5 Folding an input trace with 10 hours of total runtime to generate an output trace with 1 hour of total runtime with custom concentration. + + java org.apache.hadoop.tools.rumen.Folder -output-duration 1h -input-cycle 20m -concentration 2 file:///home/user/job-trace.json file:///home/user/job-trace-1hr.json + +This will fold the 10hr job-trace file +`file:///home/user/job-trace.json` to finish within 1hr +with concentration of 2. `Example-2.3.2` will retain 10% +of the jobs. With *concentration* as 2, 20% of the total input +jobs will be retained. + + +Appendix +-------- + +$H3 Resources + +MAPREDUCE-751 +is the main JIRA that introduced *Rumen* to *MapReduce*. +Look at the MapReduce + +rumen-componentfor further details. + + +$H3 Dependencies + +*Rumen* expects certain library *JARs* to be present in +the *CLASSPATH*. The required libraries are + +* `Hadoop MapReduce Tools` (`hadoop-mapred-tools-{hadoop-version}.jar`) +* `Hadoop Common` (`hadoop-common-{hadoop-version}.jar`) +* `Apache Commons Logging` (`commons-logging-1.1.1.jar`) +* `Apache Commons CLI` (`commons-cli-1.2.jar`) +* `Jackson Mapper` (`jackson-mapper-asl-1.4.2.jar`) +* `Jackson Core` (`jackson-core-asl-1.4.2.jar`) + +> One simple way to run Rumen is to use '$HADOOP_HOME/bin/hadoop jar' +> option to run it.