MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1591107 13f79535-47bb-0310-9956-ffa450edef68
This commit is contained in:
parent
19176f423a
commit
693025a3d4
|
@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED
|
|||
MAPREDUCE-5812. Make job context available to
|
||||
OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
|
||||
|
||||
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
|
||||
jeagles)
|
||||
|
||||
OPTIMIZATIONS
|
||||
|
||||
BUG FIXES
|
||||
|
|
|
@ -0,0 +1,138 @@
|
|||
<!---
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License. See accompanying LICENSE file.
|
||||
-->
|
||||
|
||||
#set ( $H3 = '###' )
|
||||
|
||||
Hadoop Archives Guide
|
||||
=====================
|
||||
|
||||
- [Overview](#Overview)
|
||||
- [How to Create an Archive](#How_to_Create_an_Archive)
|
||||
- [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
|
||||
- [Archives Examples](#Archives_Examples)
|
||||
- [Creating an Archive](#Creating_an_Archive)
|
||||
- [Looking Up Files](#Looking_Up_Files)
|
||||
- [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
|
||||
|
||||
Overview
|
||||
--------
|
||||
|
||||
Hadoop archives are special format archives. A Hadoop archive maps to a file
|
||||
system directory. A Hadoop archive always has a \*.har extension. A Hadoop
|
||||
archive directory contains metadata (in the form of _index and _masterindex)
|
||||
and data (part-\*) files. The _index file contains the name of the files that
|
||||
are part of the archive and the location within the part files.
|
||||
|
||||
How to Create an Archive
|
||||
------------------------
|
||||
|
||||
`Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
|
||||
|
||||
-archiveName is the name of the archive you would like to create. An example
|
||||
would be foo.har. The name should have a \*.har extension. The parent argument
|
||||
is to specify the relative path to which the files should be archived to.
|
||||
Example would be :
|
||||
|
||||
`-p /foo/bar a/b/c e/f/g`
|
||||
|
||||
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
|
||||
parent. Note that this is a Map/Reduce job that creates the archives. You
|
||||
would need a map reduce cluster to run this. For a detailed example the later
|
||||
sections.
|
||||
|
||||
If you just want to archive a single directory /foo/bar then you can just use
|
||||
|
||||
`hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
|
||||
|
||||
How to Look Up Files in Archives
|
||||
--------------------------------
|
||||
|
||||
The archive exposes itself as a file system layer. So all the fs shell
|
||||
commands in the archives work but with a different URI. Also, note that
|
||||
archives are immutable. So, rename's, deletes and creates return an error.
|
||||
URI for Hadoop Archives is
|
||||
|
||||
`har://scheme-hostname:port/archivepath/fileinarchive`
|
||||
|
||||
If no scheme is provided it assumes the underlying filesystem. In that case
|
||||
the URI would look like
|
||||
|
||||
`har:///archivepath/fileinarchive`
|
||||
|
||||
Archives Examples
|
||||
-----------------
|
||||
|
||||
$H3 Creating an Archive
|
||||
|
||||
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
|
||||
|
||||
The above example is creating an archive using /user/hadoop as the relative
|
||||
archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
|
||||
will be archived in the following file system directory -- /user/zoo/foo.har.
|
||||
Archiving does not delete the input files. If you want to delete the input
|
||||
files after creating the archives (to reduce namespace), you will have to do
|
||||
it on your own.
|
||||
|
||||
$H3 Looking Up Files
|
||||
|
||||
Looking up files in hadoop archives is as easy as doing an ls on the
|
||||
filesystem. After you have archived the directories /user/hadoop/dir1 and
|
||||
/user/hadoop/dir2 as in the example above, to see all the files in the
|
||||
archives you can just run:
|
||||
|
||||
`hdfs dfs -ls -R har:///user/zoo/foo.har/`
|
||||
|
||||
To understand the significance of the -p argument, lets go through the above
|
||||
example again. If you just do an ls (not lsr) on the hadoop archive using
|
||||
|
||||
`hdfs dfs -ls har:///user/zoo/foo.har`
|
||||
|
||||
The output should be:
|
||||
|
||||
```
|
||||
har:///user/zoo/foo.har/dir1
|
||||
har:///user/zoo/foo.har/dir2
|
||||
```
|
||||
|
||||
As you can recall the archives were created with the following command
|
||||
|
||||
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
|
||||
|
||||
If we were to change the command to:
|
||||
|
||||
`hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
|
||||
|
||||
then a ls on the hadoop archive using
|
||||
|
||||
`hdfs dfs -ls har:///user/zoo/foo.har`
|
||||
|
||||
would give you
|
||||
|
||||
```
|
||||
har:///user/zoo/foo.har/hadoop/dir1
|
||||
har:///user/zoo/foo.har/hadoop/dir2
|
||||
```
|
||||
|
||||
Notice that the archived files have been archived relative to /user/ rather
|
||||
than /user/hadoop.
|
||||
|
||||
Hadoop Archives and MapReduce
|
||||
-----------------------------
|
||||
|
||||
Using Hadoop Archives in MapReduce is as easy as specifying a different input
|
||||
filesystem than the default file system. If you have a hadoop archive stored
|
||||
in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
|
||||
all you need to specify the input directory as har:///user/zoo/foo.har. Since
|
||||
Hadoop Archives is exposed as a file system MapReduce will be able to use all
|
||||
the logical input files in Hadoop Archives as input.
|
|
@ -92,6 +92,7 @@
|
|||
<item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
|
||||
<item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
|
||||
<item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
|
||||
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
|
||||
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
|
||||
</menu>
|
||||
|
||||
|
|
Loading…
Reference in New Issue