MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-2@1591108 13f79535-47bb-0310-9956-ffa450edef68
2014-04-29 21:27:14 +00:00 · 2014-04-29 21:27:14 +00:00 · 73e0b8e55a
parent fa889d0ef6
commit 73e0b8e55a
3 changed files with 142 additions and 0 deletions
--- a/hadoop-mapreduce-project/CHANGES.txt
+++ b/hadoop-mapreduce-project/CHANGES.txt
@ -36,6 +36,9 @@ Release 2.5.0 - UNRELEASED
    MAPREDUCE-5812. Make job context available to
    OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
    MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
    jeagles)
  OPTIMIZATIONS
  BUG FIXES 
--- a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
+++ b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
@ -0,0 +1,138 @@
 <!---
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
 -->
 #set ( $H3 = '###' )
 Hadoop Archives Guide
 =====================
 - [Overview](#Overview)
 - [How to Create an Archive](#How_to_Create_an_Archive)
 - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
 - [Archives Examples](#Archives_Examples)
     - [Creating an Archive](#Creating_an_Archive)
     - [Looking Up Files](#Looking_Up_Files)
 - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
 Overview
 --------
  Hadoop archives are special format archives. A Hadoop archive maps to a file
  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
  archive directory contains metadata (in the form of _index and _masterindex)
  and data (part-\*) files. The _index file contains the name of the files that
  are part of the archive and the location within the part files.
 How to Create an Archive
 ------------------------
  `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
  -archiveName is the name of the archive you would like to create. An example
  would be foo.har. The name should have a \*.har extension. The parent argument
  is to specify the relative path to which the files should be archived to.
  Example would be :
  `-p /foo/bar a/b/c e/f/g`
  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
  parent. Note that this is a Map/Reduce job that creates the archives. You
  would need a map reduce cluster to run this. For a detailed example the later
  sections.
  If you just want to archive a single directory /foo/bar then you can just use
  `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
 How to Look Up Files in Archives
 --------------------------------
  The archive exposes itself as a file system layer. So all the fs shell
  commands in the archives work but with a different URI. Also, note that
  archives are immutable. So, rename's, deletes and creates return an error.
  URI for Hadoop Archives is
  `har://scheme-hostname:port/archivepath/fileinarchive`
  If no scheme is provided it assumes the underlying filesystem. In that case
  the URI would look like
  `har:///archivepath/fileinarchive`
 Archives Examples
 -----------------
 $H3 Creating an Archive
  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
  The above example is creating an archive using /user/hadoop as the relative
  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
  will be archived in the following file system directory -- /user/zoo/foo.har.
  Archiving does not delete the input files. If you want to delete the input
  files after creating the archives (to reduce namespace), you will have to do
  it on your own. 
 $H3 Looking Up Files
  Looking up files in hadoop archives is as easy as doing an ls on the
  filesystem. After you have archived the directories /user/hadoop/dir1 and
  /user/hadoop/dir2 as in the example above, to see all the files in the
  archives you can just run:
  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
  To understand the significance of the -p argument, lets go through the above
  example again. If you just do an ls (not lsr) on the hadoop archive using
  `hdfs dfs -ls har:///user/zoo/foo.har`
  The output should be:
 ```
 har:///user/zoo/foo.har/dir1
 har:///user/zoo/foo.har/dir2
 ```
  As you can recall the archives were created with the following command
  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
  If we were to change the command to:
  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
  then a ls on the hadoop archive using
  `hdfs dfs -ls har:///user/zoo/foo.har`
  would give you
 ```
 har:///user/zoo/foo.har/hadoop/dir1
 har:///user/zoo/foo.har/hadoop/dir2
 ```
  Notice that the archived files have been archived relative to /user/ rather
  than /user/hadoop.
 Hadoop Archives and MapReduce
 -----------------------------
  Using Hadoop Archives in MapReduce is as easy as specifying a different input
  filesystem than the default file system. If you have a hadoop archive stored
  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
  all you need to specify the input directory as har:///user/zoo/foo.har. Since
  Hadoop Archives is exposed as a file system MapReduce will be able to use all
  the logical input files in Hadoop Archives as input.
--- a/hadoop-project/src/site/site.xml
+++ b/hadoop-project/src/site/site.xml
@ -92,6 +92,7 @@
      <item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
      <item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
      <item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
      <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
      <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
    </menu>