diff --git a/hadoop-mapreduce-project/CHANGES.txt b/hadoop-mapreduce-project/CHANGES.txt
index 764975f0554..b339d542c40 100644
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@@ -4,7 +4,8 @@ Release 0.23.1 - Unreleased
INCOMPATIBLE CHANGES
- NEW FEATURES
+ NEW FEATURES
+ MAPREDUCE-778. Rumen Anonymizer. (Amar Kamat and Chris Douglas via amarrk)
MAPREDUCE-3121. NodeManager should handle disk-failures (Ravi Gummadi via mahadev)
@@ -14,6 +15,8 @@ Release 0.23.1 - Unreleased
MAPREDUCE-3251. Network ACLs can prevent some clients to talk to MR ApplicationMaster.
(Anupam Seth via mahadev)
+ MAPREDUCE-778. Rumen Anonymizer. (Amar Kamat and Chris Douglas via amarrk)
+
IMPROVEMENTS
MAPREDUCE-3375. [Gridmix] Memory Emulation system tests.
(Vinay Thota via amarrk)
diff --git a/hadoop-mapreduce-project/ivy.xml b/hadoop-mapreduce-project/ivy.xml
index e9b38d077eb..e04da7019bb 100644
--- a/hadoop-mapreduce-project/ivy.xml
+++ b/hadoop-mapreduce-project/ivy.xml
@@ -139,6 +139,13 @@
Rumen provides 2 basic commands Rumen provides 3 basic commands Firstly, we need to generate the Gold Trace. Hence the first
@@ -139,8 +150,9 @@
The output of the output-duration
, concentration
etc.
-
TraceBuilder
Folder
Anonymizer
TraceBuilder
is a job-trace file (and an
optional cluster-topology file). In case we want to scale the output, we
can use the Folder
utility to fold the current trace to the
- desired length. The remaining part of this section explains these
- utilities in detail.
+ desired length. For anonymizing the trace, use the
+ Anonymizer
utility. The remaining part of this section
+ explains these utilities in detail.
Command:
This command invokes the Anonymizer utility of
+ Rumen. It anonymizes sensitive information from the
+ <jobtrace-input>
file and outputs the anonymized
+ content into the <jobtrace-output>
+ file. It also anonymizes the cluster layout (topology) from the
+ <topology-input>
and outputs it in
+ the <topology-output>
file.
+ <job-input>
represents the job trace file obtained
+ using TraceBuilder
or Folder
.
+ <topology-input>
represents the cluster topology
+ file obtained using TraceBuilder
.
+
Options :
Parameter | +Description | +Notes | +
---|---|---|
-trace |
+ Anonymizes job traces. | +Anonymizes sensitive fields like user-name, job-name, queue-name + host-names, job configuration parameters etc. | +
-topology |
+ Anonymizes cluster topology | +Anonymizes rack-names and host-names. | +
The Rumen anonymizer can be configured using the following + configuration parameters: +
+Parameter | +Description | +
---|---|
+ rumen.data-types.classname.preserve
+ |
+ A comma separated list of prefixes that the Anonymizer
+ will not anonymize while processing classnames. If
+ rumen.data-types.classname.preserve is set to
+ 'org.apache,com.hadoop.' then
+ classnames starting with 'org.apache' or
+ 'com.hadoop.' will not be anonymized.
+ |
+
+ rumen.datatypes.jobproperties.parsers
+ |
+ A comma separated list of job properties parsers. These parsers
+ decide how the job configuration parameters
+ (i.e <key,value> pairs) should be processed. Default is
+ MapReduceJobPropertiesParser . The default parser will
+ only parse framework-level MapReduce specific job configuration
+ properties. Users can add custom parsers by implementing the
+ JobPropertiesParser interface. Rumen also provides an
+ all-pass (i.e no filter) parser called
+ DefaultJobPropertiesParser .
+ |
+
+ rumen.anonymization.states.dir
+ |
+ Set this to a location (on LocalFileSystem or HDFS) for enabling + state persistence and/or reload. This parameter is not set by + default. Reloading and persistence of states depend on the state + directory. Note that the state directory will contain the latest + as well as previous states. + | +
+ rumen.anonymization.states.persist
+ |
+ Set this to 'true' to persist the current state.
+ Default value is 'false' . Note that the states will
+ be persisted to the state manager's state directory
+ specified using the rumen.anonymization.states.dir
+ parameter.
+ |
+
+ rumen.anonymization.states.reload
+ |
+ Set this to 'true' to enable reuse of previously
+ persisted state. The default value is 'false' . The
+ previously persisted state will be reloaded from the state
+ manager's state directory specified using the
+ rumen.anonymization.states.dir parameter. Note that
+ the Anonymizer will bail out if it fails to find any
+ previously persisted state in the state directory or if the state
+ directory is not set. If the user wishes to retain/reuse the
+ states across multiple invocations of the Anonymizer,
+ then the very first invocation of the Anonymizer should
+ have rumen.anonymization.states.reload set to
+ 'false' and
+ rumen.anonymization.states.persist set to
+ 'true' . Subsequent invocations of the
+ Anonymizer can then have
+ rumen.anonymization.states.reload set to
+ 'true' .
+ |
+
This will anonymize the job details from
+ file:///home/user/job-trace.json
and output it to
+ file:///home/user/job-trace-anonymized.json
.
+ It will also anonymize the cluster topology layout from
+ file:///home/user/cluster-topology.json
and output it to
+ file:///home/user/cluster-topology-anonymized.json
.
+ Note that the Anonymizer
also supports input and output
+ files on HDFS.
+