From 73e0b8e55ab32ee5858a04038e5a826873bd8df7 Mon Sep 17 00:00:00 2001 From: Jonathan Turner Eagles Date: Tue, 29 Apr 2014 21:27:14 +0000 Subject: [PATCH] MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles) git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2@1591108 13f79535-47bb-0310-9956-ffa450edef68 --- hadoop-mapreduce-project/CHANGES.txt | 3 + .../src/site/markdown/HadoopArchives.md.vm | 138 ++++++++++++++++++ hadoop-project/src/site/site.xml | 1 + 3 files changed, 142 insertions(+) create mode 100644 hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm diff --git a/hadoop-mapreduce-project/CHANGES.txt b/hadoop-mapreduce-project/CHANGES.txt index a734784a266..0ecc4f5f67c 100644 --- a/hadoop-mapreduce-project/CHANGES.txt +++ b/hadoop-mapreduce-project/CHANGES.txt @@ -36,6 +36,9 @@ Release 2.5.0 - UNRELEASED MAPREDUCE-5812. Make job context available to OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe) + MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via + jeagles) + OPTIMIZATIONS BUG FIXES diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm new file mode 100644 index 00000000000..431310a9f5c --- /dev/null +++ b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm @@ -0,0 +1,138 @@ + + +#set ( $H3 = '###' ) + +Hadoop Archives Guide +===================== + + - [Overview](#Overview) + - [How to Create an Archive](#How_to_Create_an_Archive) + - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives) + - [Archives Examples](#Archives_Examples) + - [Creating an Archive](#Creating_an_Archive) + - [Looking Up Files](#Looking_Up_Files) + - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce) + +Overview +-------- + + Hadoop archives are special format archives. A Hadoop archive maps to a file + system directory. A Hadoop archive always has a \*.har extension. A Hadoop + archive directory contains metadata (in the form of _index and _masterindex) + and data (part-\*) files. The _index file contains the name of the files that + are part of the archive and the location within the part files. + +How to Create an Archive +------------------------ + + `Usage: hadoop archive -archiveName name -p * ` + + -archiveName is the name of the archive you would like to create. An example + would be foo.har. The name should have a \*.har extension. The parent argument + is to specify the relative path to which the files should be archived to. + Example would be : + + `-p /foo/bar a/b/c e/f/g` + + Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to + parent. Note that this is a Map/Reduce job that creates the archives. You + would need a map reduce cluster to run this. For a detailed example the later + sections. + + If you just want to archive a single directory /foo/bar then you can just use + + `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir` + +How to Look Up Files in Archives +-------------------------------- + + The archive exposes itself as a file system layer. So all the fs shell + commands in the archives work but with a different URI. Also, note that + archives are immutable. So, rename's, deletes and creates return an error. + URI for Hadoop Archives is + + `har://scheme-hostname:port/archivepath/fileinarchive` + + If no scheme is provided it assumes the underlying filesystem. In that case + the URI would look like + + `har:///archivepath/fileinarchive` + +Archives Examples +----------------- + +$H3 Creating an Archive + + `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo` + + The above example is creating an archive using /user/hadoop as the relative + archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2 + will be archived in the following file system directory -- /user/zoo/foo.har. + Archiving does not delete the input files. If you want to delete the input + files after creating the archives (to reduce namespace), you will have to do + it on your own. + +$H3 Looking Up Files + + Looking up files in hadoop archives is as easy as doing an ls on the + filesystem. After you have archived the directories /user/hadoop/dir1 and + /user/hadoop/dir2 as in the example above, to see all the files in the + archives you can just run: + + `hdfs dfs -ls -R har:///user/zoo/foo.har/` + + To understand the significance of the -p argument, lets go through the above + example again. If you just do an ls (not lsr) on the hadoop archive using + + `hdfs dfs -ls har:///user/zoo/foo.har` + + The output should be: + +``` +har:///user/zoo/foo.har/dir1 +har:///user/zoo/foo.har/dir2 +``` + + As you can recall the archives were created with the following command + + `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo` + + If we were to change the command to: + + `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo` + + then a ls on the hadoop archive using + + `hdfs dfs -ls har:///user/zoo/foo.har` + + would give you + +``` +har:///user/zoo/foo.har/hadoop/dir1 +har:///user/zoo/foo.har/hadoop/dir2 +``` + + Notice that the archived files have been archived relative to /user/ rather + than /user/hadoop. + +Hadoop Archives and MapReduce +----------------------------- + + Using Hadoop Archives in MapReduce is as easy as specifying a different input + filesystem than the default file system. If you have a hadoop archive stored + in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input, + all you need to specify the input directory as har:///user/zoo/foo.har. Since + Hadoop Archives is exposed as a file system MapReduce will be able to use all + the logical input files in Hadoop Archives as input. diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml index e3ffb24d44e..9a39d533798 100644 --- a/hadoop-project/src/site/site.xml +++ b/hadoop-project/src/site/site.xml @@ -92,6 +92,7 @@ +