From fc1bf705e9b14833385ff0fd31a5bc43ba8fd57d Mon Sep 17 00:00:00 2001 From: Christopher Douglas Date: Fri, 17 Jul 2009 02:06:42 +0000 Subject: [PATCH] HADOOP-6142. Update documentation and use of harchives for relative paths added in MAPREDUCE-739. Contributed by Mahadev Konar git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@794943 13f79535-47bb-0310-9956-ffa450edef68 --- CHANGES.txt | 3 + bin/hadoop | 2 +- .../content/xdocs/hadoop_archives.xml | 89 +++++++++++++------ 3 files changed, 67 insertions(+), 27 deletions(-) diff --git a/CHANGES.txt b/CHANGES.txt index 051b3101279..88ca9c8f0d2 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -472,6 +472,9 @@ Trunk (unreleased changes) HADOOP-6099. The RPC module can be configured to not send period pings. The default behaviour of sending periodic pings remain unchanged. (dhruba) + HADOOP-6142. Update documentation and use of harchives for relative paths + added in MAPREDUCE-739. (Mahadev Konar via cdouglas) + OPTIMIZATIONS HADOOP-5595. NameNode does not need to run a replicator to choose a diff --git a/bin/hadoop b/bin/hadoop index 4c680212155..7618e1a0819 100755 --- a/bin/hadoop +++ b/bin/hadoop @@ -29,7 +29,7 @@ function print_usage(){ echo " version print the version" echo " jar run a jar file" echo " distcp copy file or directories recursively" - echo " archive -archiveName NAME * create a hadoop archive" + echo " archive -archiveName NAME -p * create a hadoop archive" echo " classpath prints the class path needed to get the" echo " Hadoop jar and the required libraries" echo " daemonlog get/set the log level for each daemon" diff --git a/src/docs/src/documentation/content/xdocs/hadoop_archives.xml b/src/docs/src/documentation/content/xdocs/hadoop_archives.xml index 44e7b8a1929..92b3053de05 100644 --- a/src/docs/src/documentation/content/xdocs/hadoop_archives.xml +++ b/src/docs/src/documentation/content/xdocs/hadoop_archives.xml @@ -32,26 +32,25 @@ within the part files.

+
How to create an archive?

- Usage: hadoop archive -archiveName name <src>* <dest> + Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>

-archiveName is the name of the archive you would like to create. An example would be foo.har. The name should have a *.har extension. - The inputs are file system pathnames which work as usual with regular - expressions. The destination directory would contain the archive. + The parent argument is to specify the relative path to which the files should be + archived to. Example would be : +

-p /foo/bar a/b/c e/f/g

+ Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to parent. Note that this is a Map/Reduce job that creates the archives. You would - need a map reduce cluster to run this. The following is an example:

-

- hadoop archive -archiveName foo.har /user/hadoop/dir1 /user/hadoop/dir2 /user/zoo/ -

- In the above example /user/hadoop/dir1 and /user/hadoop/dir2 will be - archived in the following file system directory -- /user/zoo/foo.har. - The sources are not changed or removed when an archive is created. -

+ need a map reduce cluster to run this. For a detailed example the later sections.

+

If you just want to archive a single directory /foo/bar then you can just use

+

hadoop archive -archiveName zoo.har -p /foo/bar /outputdir

+
How to look up files in archives?

@@ -61,20 +60,58 @@ an error. URI for Hadoop Archives is

har://scheme-hostname:port/archivepath/fileinarchive

If no scheme is provided it assumes the underlying filesystem. - In that case the URI would look like -

- har:///archivepath/fileinarchive

-

- Here is an example of archive. The input to the archives is /dir. The directory dir contains - files filea, fileb. To archive /dir to /user/hadoop/foo.har, the command is -

-

hadoop archive -archiveName foo.har /dir /user/hadoop -

- To get file listing for files in the created archive -

-

hadoop dfs -lsr har:///user/hadoop/foo.har

-

To cat filea in archive - -

hadoop dfs -cat har:///user/hadoop/foo.har/dir/filea

+ In that case the URI would look like

+

har:///archivepath/fileinarchive

- + +
+ Example on creating and looking up archives +

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

+

+ The above example is creating an archive using /user/hadoop as the relative archive directory. + The directories /user/hadoop/dir1 and /user/hadoop/dir2 will be + archived in the following file system directory -- /user/zoo/foo.har. Archiving does not delete the input + files. If you want to delete the input files after creating the archives (to reduce namespace), you + will have to do it on your own. +

+ +
+ Looking up files and understanding the -p option +

Looking up files in hadoop archives is as easy as doing an ls on the filesystem. After you have + archived the directories /user/hadoop/dir1 and /user/hadoop/dir2 as in the exmaple above, to see all + the files in the archives you can just run:

+

hadoop dfs -lsr har:///user/zoo/foo.har/

+

To understand the significance of the -p argument, lets go through the above example again. If you just do + an ls (not lsr) on the hadoop archive using

+

hadoop dfs -ls har:///user/zoo/foo.har

+

The output should be:

+ +har:///user/zoo/foo.har/dir1 +har:///user/zoo/foo.har/dir2 + +

As you can recall the archives were created with the following command

+

hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo

+

If we were to change the command to:

+

hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo

+

then a ls on the hadoop archive using

+

hadoop dfs -ls har:///user/zoo/foo.har

+

would give you

+ +har:///user/zoo/foo.har/hadoop/dir1 +har:///user/zoo/foo.har/hadoop/dir2 + +

+ Notice that the archived files have been archived relative to /user/ rather than /user/hadoop. +

+
+
+ +
+ Using Hadoop Archives with Map Reduce +

Using Hadoop Archives in Map Reduce is as easy as specifying a different input filesystem than the default file system. + If you have a hadoop archive stored in HDFS in /user/zoo/foo.har then for using this archive for Map Reduce input, all + you need to specify the input directory as har:///user/zoo/foo.har. Since Hadoop Archives is exposed as a file system + Map Reduce will be able to use all the logical input files in Hadoop Archives as input.

+
+