Anonymized traces enables sharing of production traces of large
+ scale Hadoop deployments. Sharing of traces will foster
+ collaboration within the Hadoop community. It can also be used to
+ supplement interesting research findings.
+
@@ -102,6 +107,11 @@
Increasing the trace runtime might involve adding some dummy jobs to
the resulting trace and scaling up the runtime of individual jobs.
+
Anonymizer :
+ A utility to anonymize Hadoop job and cluster topology traces by
+ masking certain sensitive fields but retaining important workload
+ characteristics.
+
@@ -128,10 +138,11 @@
output-duration, concentration etc.
-
Rumen provides 2 basic commands
+
Rumen provides 3 basic commands
TraceBuilder
Folder
+
Anonymizer
Firstly, we need to generate the Gold Trace. Hence the first
@@ -139,8 +150,9 @@
The output of the TraceBuilder is a job-trace file (and an
optional cluster-topology file). In case we want to scale the output, we
can use the Folder utility to fold the current trace to the
- desired length. The remaining part of this section explains these
- utilities in detail.
+ desired length. For anonymizing the trace, use the
+ Anonymizer utility. The remaining part of this section
+ explains these utilities in detail.
Examples in this section assumes that certain libraries are present
@@ -426,8 +438,156 @@
+
+
+
+
+
+
+
+
+
+
+ Anonymizer
+
+
Command:
+
+
+
This command invokes the Anonymizer utility of
+ Rumen. It anonymizes sensitive information from the
+ <jobtrace-input> file and outputs the anonymized
+ content into the <jobtrace-output>
+ file. It also anonymizes the cluster layout (topology) from the
+ <topology-input> and outputs it in
+ the <topology-output> file.
+ <job-input> represents the job trace file obtained
+ using TraceBuilder or Folder.
+ <topology-input> represents the cluster topology
+ file obtained using TraceBuilder.
+
+
+
Options :
+
+
+
Parameter
+
Description
+
Notes
+
+
+
-trace
+
Anonymizes job traces.
+
Anonymizes sensitive fields like user-name, job-name, queue-name
+ host-names, job configuration parameters etc.
+
+
+
-topology
+
Anonymizes cluster topology
+
Anonymizes rack-names and host-names.
+
+
+
+
+ Anonymizer Configuration Parameters
+
The Rumen anonymizer can be configured using the following
+ configuration parameters:
+
+
+
+
Parameter
+
Description
+
+
+
+ rumen.data-types.classname.preserve
+
+
A comma separated list of prefixes that the Anonymizer
+ will not anonymize while processing classnames. If
+ rumen.data-types.classname.preserve is set to
+ 'org.apache,com.hadoop.' then
+ classnames starting with 'org.apache' or
+ 'com.hadoop.' will not be anonymized.
+
+
+
+
+ rumen.datatypes.jobproperties.parsers
+
+
A comma separated list of job properties parsers. These parsers
+ decide how the job configuration parameters
+ (i.e <key,value> pairs) should be processed. Default is
+ MapReduceJobPropertiesParser. The default parser will
+ only parse framework-level MapReduce specific job configuration
+ properties. Users can add custom parsers by implementing the
+ JobPropertiesParser interface. Rumen also provides an
+ all-pass (i.e no filter) parser called
+ DefaultJobPropertiesParser.
+
+
+
+
+ rumen.anonymization.states.dir
+
+
Set this to a location (on LocalFileSystem or HDFS) for enabling
+ state persistence and/or reload. This parameter is not set by
+ default. Reloading and persistence of states depend on the state
+ directory. Note that the state directory will contain the latest
+ as well as previous states.
+
+
+
+
+ rumen.anonymization.states.persist
+
+
Set this to 'true' to persist the current state.
+ Default value is 'false'. Note that the states will
+ be persisted to the state manager's state directory
+ specified using the rumen.anonymization.states.dir
+ parameter.
+
+
+
+
+ rumen.anonymization.states.reload
+
+
Set this to 'true' to enable reuse of previously
+ persisted state. The default value is 'false'. The
+ previously persisted state will be reloaded from the state
+ manager's state directory specified using the
+ rumen.anonymization.states.dir parameter. Note that
+ the Anonymizer will bail out if it fails to find any
+ previously persisted state in the state directory or if the state
+ directory is not set. If the user wishes to retain/reuse the
+ states across multiple invocations of the Anonymizer,
+ then the very first invocation of the Anonymizer should
+ have rumen.anonymization.states.reload set to
+ 'false' and
+ rumen.anonymization.states.persist set to
+ 'true'. Subsequent invocations of the
+ Anonymizer can then have
+ rumen.anonymization.states.reload set to
+ 'true'.
+
+
+
+
+
+
+ Example
+
+
+
This will anonymize the job details from
+ file:///home/user/job-trace.json and output it to
+ file:///home/user/job-trace-anonymized.json.
+ It will also anonymize the cluster topology layout from
+ file:///home/user/cluster-topology.json and output it to
+ file:///home/user/cluster-topology-anonymized.json.
+ Note that the Anonymizer also supports input and output
+ files on HDFS.
+