HDFS-5865. Update OfflineImageViewer document. Contributed by Akira Ajisaka.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1590100 13f79535-47bb-0310-9956-ffa450edef68
2014-04-25 18:43:41 +00:00 · 2014-04-25 18:43:41 +00:00 · 445b742354
parent a059eadbe9
commit 445b742354
2 changed files with 101 additions and 325 deletions
--- a/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
+++ b/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
@ -404,6 +404,8 @@ Release 2.5.0 - UNRELEASED
    HDFS-6276. Remove unnecessary conditions and null check. (suresh)
    HDFS-5865. Update OfflineImageViewer document. (Akira Ajisaka via wheat9)
 Release 2.4.1 - UNRELEASED
  INCOMPATIBLE CHANGES
--- a/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
+++ b/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
@ -23,56 +23,29 @@ Offline Image Viewer Guide
 * Overview
   The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
-   files to human-readable formats in order to allow offline analysis and
+   files to a human-readable format and provide read-only WebHDFS API
-   examination of an Hadoop cluster's namespace. The tool is able to
+   in order to allow offline analysis and examination of an Hadoop cluster's
-   process very large image files relatively quickly, converting them to
+   namespace. The tool is able to process very large image files relatively
-   one of several output formats. The tool handles the layout formats that
+   quickly. The tool handles the layout formats that were included with Hadoop
-   were included with Hadoop versions 16 and up. If the tool is not able
+   versions 2.4 and up. If you want to handle older layout formats, you can
-   to process an image file, it will exit cleanly. The Offline Image
+   use the Offline Image Viewer of Hadoop 2.3.
-   Viewer does not require an Hadoop cluster to be running; it is entirely
+   If the tool is not able to process an image file, it will exit cleanly.
-   offline in its operation.
+   The Offline Image Viewer does not require a Hadoop cluster to be running;
   it is entirely offline in its operation.
   The Offline Image Viewer provides several output processors:
-   [[1]] Ls is the default output processor. It closely mimics the format of
+   [[1]] Web is the default output processor. It launches a HTTP server
-      the lsr command. It includes the same fields, in the same order, as
+      that exposes read-only WebHDFS API. Users can investigate the namespace
-      lsr : directory or file flag, permissions, replication, owner,
+      interactively by using HTTP REST API.
      group, file size, modification date, and full path. Unlike the lsr
      command, the root path is included. One important difference
      between the output of the lsr command this processor, is that this
      output is not sorted by directory name and contents. Rather, the
      files are listed in the order in which they are stored in the
      fsimage file. Therefore, it is not possible to directly compare the
      output of the lsr command this this tool. The Ls processor uses
      information contained within the Inode blocks to calculate file
      sizes and ignores the -skipBlocks option.
-   [[2]] Indented provides a more complete view of the fsimage's contents,
+   [[2]] XML creates an XML document of the fsimage and includes all of the
      including all of the information included in the image, such as
      image version, generation stamp and inode- and block-specific
      listings. This processor uses indentation to organize the output
      into a hierarchal manner. The lsr format is suitable for easy human
      comprehension.
   [[3]] Delimited provides one file per line consisting of the path,
      replication, modification time, access time, block size, number of
      blocks, file size, namespace quota, diskspace quota, permissions,
      username and group name. If run against an fsimage that does not
      contain any of these fields, the field's column will be included,
      but no data recorded. The default record delimiter is a tab, but
      this may be changed via the -delimiter command line argument. This
      processor is designed to create output that is easily analyzed by
      other tools, such as {{{http://pig.apache.org}Apache Pig}}. See
      the {{Analyzing Results}} section for further information on using
      this processor to analyze the contents of fsimage files.
   [[4]] XML creates an XML document of the fsimage and includes all of the
      information within the fsimage, similar to the lsr processor. The
      output of this processor is amenable to automated processing and
      analysis with XML tools. Due to the verbosity of the XML syntax,
      this processor will also generate the largest amount of output.
-   [[5]] FileDistribution is the tool for analyzing file sizes in the
+   [[3]] FileDistribution is the tool for analyzing file sizes in the
      namespace image. In order to run the tool one should define a range
      of integers [0, maxSize] by specifying maxSize and a step. The
      range of integers is divided into segments of size step: [0, s[1],
@ -86,105 +59,93 @@ Offline Image Viewer Guide
 * Usage
-** Basic
+** Web Processor
-   The simplest usage of the Offline Image Viewer is to provide just an
+   Web processor launches a HTTP server which exposes read-only WebHDFS API.
-   input and output file, via the -i and -o command-line switches:
+   Users can specify the address to listen by -addr option (default by
   localhost:5978).
 ----
-   bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
+   bash$ bin/hdfs oiv -i fsimage
   14/04/07 13:25:14 INFO offlineImageViewer.WebImageViewer: WebImageViewer
   started. Listening on /127.0.0.1:5978. Press Ctrl+C to stop the viewer.
 ----
-   This will create a file named fsimage.txt in the current directory
+   Users can access the viewer and get the information of the fsimage by
-   using the Ls output processor. For very large image files, this process
+   the following shell command:
   may take several minutes.
   One can specify which output processor via the command-line switch -p.
   For instance:
 ----
-   bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
+   bash$ bin/hdfs dfs -ls webhdfs://127.0.0.1:5978/
   Found 2 items
   drwxrwx---   - root supergroup          0 2014-03-26 20:16 webhdfs://127.0.0.1:5978/tmp
   drwxr-xr-x   - root supergroup          0 2014-03-31 14:08 webhdfs://127.0.0.1:5978/user
 ----
-   or
+   To get the information of all the files and directories, you can simply use
   the following command:
 ----
-   bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
+   bash$ bin/hdfs dfs -ls -R webhdfs://127.0.0.1:5978/
 ----
-   This will run the tool using either the XML or Indented output
+   Users can also get JSON formatted FileStatuses via HTTP REST API.
   processor, respectively.
   One command-line option worth considering is -skipBlocks, which
   prevents the tool from explicitly enumerating all of the blocks that
   make up a file in the namespace. This is useful for file systems that
   have very large files. Enabling this option can significantly decrease
   the size of the resulting output, as individual blocks are not
   included. Note, however, that the Ls processor needs to enumerate the
   blocks and so overrides this option.
 Example
   Consider the following contrived namespace:
 ----
-   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:17 /anotherDir
+   bash$ curl -i http://127.0.0.1:5978/webhdfs/v1/?op=liststatus
-   -rw-r--r--   3 theuser supergroup  286631664 2009-03-16 21:15 /anotherDir/biggerfile
+   HTTP/1.1 200 OK
-   -rw-r--r--   3 theuser supergroup       8754 2009-03-16 21:17 /anotherDir/smallFile
+   Content-Type: application/json
-   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem
+   Content-Length: 252
-   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser
+
-   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
+   {"FileStatuses":{"FileStatus":[
-   drwx-wx-wx   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
+   {"fileId":16386,"accessTime":0,"replication":0,"owner":"theuser","length":0,"permission":"755","blockSize":0,"modificationTime":1392772497282,"type":"DIRECTORY","group":"supergroup","childrenNum":1,"pathSuffix":"user"}
-   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:12 /one
+   ]}}
   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:12 /one/two
   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:16 /user
   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:19 /user/theuser
 ----
-   Applying the Offline Image Processor against this file with default
+   The Web processor now supports the following operations:
-   options would result in the following output:
+
   * {{{./WebHDFS.html#List_a_Directory}LISTSTATUS}}
   * {{{./WebHDFS.html#Status_of_a_FileDirectory}GETFILESTATUS}}
   * {{{./WebHDFS.html#Get_ACL_Status}GETACLSTATUS}}
 ** XML Processor
   XML Processor is used to dump all the contents in the fsimage. Users can
   specify input and output file via -i and -o command-line.
 ----
-   machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
+   bash$ bin/hdfs oiv -p XML -i fsimage -o fsimage.xml
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:16 /
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:17 /anotherDir
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:12 /one
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:16 /user
   -rw-r--r--  3   theuser supergroup    286631664 2009-03-16 14:15 /anotherDir/biggerfile
   -rw-r--r--  3   theuser supergroup         8754 2009-03-16 14:17 /anotherDir/smallFile
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
   drwx-wx-wx  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:12 /one/two
   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:19 /user/theuser
 ----
-   Similarly, applying the Indented processor would generate output that
+   This will create a file named fsimage.xml contains all the information in
-   begins with:
+   the fsimage. For very large image files, this process may take several
   minutes.
   Applying the Offline Image Viewer with XML processor would result in the
   following output:
 ----
-   machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
+   <?xml version="1.0"?>
-
+   <fsimage>
-   FSImage
+   <NameSection>
-     ImageVersion = -19
+     <genstampV1>1000</genstampV1>
-     NamespaceID = 2109123098
+     <genstampV2>1002</genstampV2>
-     GenerationStamp = 1003
+     <genstampV1Limit>0</genstampV1Limit>
-     INodes [NumInodes = 12]
+     <lastAllocatedBlockId>1073741826</lastAllocatedBlockId>
-       Inode
+     <txid>37</txid>
-         INodePath =
+   </NameSection>
-         Replication = 0
+   <INodeSection>
-         ModificationTime = 2009-03-16 14:16
+     <lastInodeId>16400</lastInodeId>
-         AccessTime = 1969-12-31 16:00
+     <inode>
-         BlockSize = 0
+       <id>16385</id>
-         Blocks [NumBlocks = -1]
+       <type>DIRECTORY</type>
-         NSQuota = 2147483647
+       <name></name>
-         DSQuota = -1
+       <mtime>1392772497282</mtime>
-         Permissions
+       <permission>theuser:supergroup:rwxr-xr-x</permission>
-           Username = theuser
+       <nsquota>9223372036854775807</nsquota>
-           GroupName = supergroup
+       <dsquota>-1</dsquota>
-           PermString = rwxr-xr-x
+     </inode>
   ...remaining output omitted...
 ----
@ -193,30 +154,32 @@ Example
 *-----------------------:-----------------------------------+
 | <<Flag>>              | <<Description>>                   |
 *-----------------------:-----------------------------------+
-| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
+| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file
-|                       | process. Required.
+|                       | to process. Required.
 *-----------------------:-----------------------------------+
-| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if the
+| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename,
-|                       | specified output processor generates one. If the specified file already
+|                       | if the specified output processor generates one. If
-|                       | exists, it is silently overwritten. Required.
+|                       | the specified file already exists, it is silently
 |                       | overwritten. (output to stdout by default)
 *-----------------------:-----------------------------------+
-| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to apply
+| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to
-|                       | against the image file. Currently valid options are Ls (default), XML
+|                       | apply against the image file. Currently valid options
-|                       | and Indented..
+|                       | are Web (default), XML and FileDistribution.
 *-----------------------:-----------------------------------+
-| <<<-skipBlocks>>>     | Do not enumerate individual blocks within files. This may
+| <<<-addr>>> <address> | Specify the address(host:port) to listen.
-|                       | save processing time and outfile file space on namespaces with very
+|                       | (localhost:5978 by default). This option is used with
-|                       | large files. The Ls processor reads the blocks to correctly determine
+|                       | Web processor.
 |                       | file sizes and ignores this option.
 *-----------------------:-----------------------------------+
-| <<<-printToScreen>>>  | Pipe output of processor to console as well as specified
+| <<<-maxSize>>> <size> | Specify the range [0, maxSize] of file sizes to be
-|                       | file. On extremely large namespaces, this may increase processing time
+|                       | analyzed in bytes (128GB by default). This option is
-|                       | by an order of magnitude.
+|                       | used with FileDistribution processor.
 *-----------------------:-----------------------------------+
-| <<<-delimiter>>> <arg>| When used in conjunction with the Delimited processor,
+| <<<-step>>> <size>    | Specify the granularity of the distribution in bytes
-|                       | replaces the default tab delimiter with the string specified by arg.
+|                       | (2MB by default). This option is used with
 |                       | FileDistribution processor.
 *-----------------------:-----------------------------------+
-| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
+| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and
 |                       | exit.
 *-----------------------:-----------------------------------+
 * Analyzing Results
@ -224,193 +187,4 @@ Example
   The Offline Image Viewer makes it easy to gather large amounts of data
   about the hdfs namespace. This information can then be used to explore
   file system usage patterns or find specific files that match arbitrary
-   criteria, along with other types of namespace analysis. The Delimited
+   criteria, along with other types of namespace analysis.
   image processor in particular creates output that is amenable to
   further processing by tools such as [38]Apache Pig. Pig provides a
   particularly good choice for analyzing these data as it is able to deal
   with the output generated from a small fsimage but also scales up to
   consume data from extremely large file systems.
   The Delimited image processor generates lines of text separated, by
   default, by tabs and includes all of the fields that are common between
   constructed files and files that were still under constructed when the
   fsimage was generated. Examples scripts are provided demonstrating how
   to use this output to accomplish three tasks: determine the number of
   files each user has created on the file system, find files were created
   but have not accessed, and find probable duplicates of large files by
   comparing the size of each file.
   Each of the following scripts assumes you have generated an output file
   using the Delimited processor named foo and will be storing the results
   of the Pig analysis in a file named results.
 ** Total Number of Files for Each User
   This script processes each path within the namespace, groups them by
   the file owner and determines the total number of files each user owns.
 ----
      numFilesOfEachUser.pig:
   -- This script determines the total number of files each user has in
   -- the namespace. Its output is of the form:
   --   username, totalNumFiles
   -- Load all of the fields from the file
   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
                                                    replication:int,
                                                    modTime:chararray,
                                                    accessTime:chararray,
                                                    blockSize:long,
                                                    numBlocks:int,
                                                    fileSize:long,
                                                    NamespaceQuota:int,
                                                    DiskspaceQuota:int,
                                                    perms:chararray,
                                                    username:chararray,
                                                    groupname:chararray);
   -- Grab just the path and username
   B = FOREACH A GENERATE path, username;
   -- Generate the sum of the number of paths for each user
   C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
   -- Save results
   STORE C INTO '$outputFile';
 ----
   This script can be run against pig with the following command:
 ----
   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
 ----
   The output file's content will be similar to that below:
 ----
   bart 1
   lisa 16
   homer 28
   marge 2456
 ----
 ** Files That Have Never Been Accessed
   This script finds files that were created but whose access times were
   never changed, meaning they were never opened or viewed.
 ----
      neverAccessed.pig:
   -- This script generates a list of files that were created but never
   -- accessed, based on their AccessTime
   -- Load all of the fields from the file
   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
                                                    replication:int,
                                                    modTime:chararray,
                                                    accessTime:chararray,
                                                    blockSize:long,
                                                    numBlocks:int,
                                                    fileSize:long,
                                                    NamespaceQuota:int,
                                                    DiskspaceQuota:int,
                                                    perms:chararray,
                                                    username:chararray,
                                                    groupname:chararray);
   -- Grab just the path and last time the file was accessed
   B = FOREACH A GENERATE path, accessTime;
   -- Drop all the paths that don't have the default assigned last-access time
   C = FILTER B BY accessTime == '1969-12-31 16:00';
   -- Drop the accessTimes, since they're all the same
   D = FOREACH C GENERATE path;
   -- Save results
   STORE D INTO '$outputFile';
 ----
   This script can be run against pig with the following command and its
   output file's content will be a list of files that were created but
   never viewed afterwards.
 ----
   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
 ----
 ** Probable Duplicated Files Based on File Size
   This script groups files together based on their size, drops any that
   are of less than 100mb and returns a list of the file size, number of
   files found and a tuple of the file paths. This can be used to find
   likely duplicates within the filesystem namespace.
 ----
      probableDuplicates.pig:
   -- This script finds probable duplicate files greater than 100 MB by
   -- grouping together files based on their byte size. Files of this size
   -- with exactly the same number of bytes can be considered probable
   -- duplicates, but should be checked further, either by comparing the
   -- contents directly or by another proxy, such as a hash of the contents.
   -- The scripts output is of the type:
   --    fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
   -- Load all of the fields from the file
   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
                                                    replication:int,
                                                    modTime:chararray,
                                                    accessTime:chararray,
                                                    blockSize:long,
                                                    numBlocks:int,
                                                    fileSize:long,
                                                    NamespaceQuota:int,
                                                    DiskspaceQuota:int,
                                                    perms:chararray,
                                                    username:chararray,
                                                    groupname:chararray);
   -- Grab the pathname and filesize
   B = FOREACH A generate path, fileSize;
   -- Drop files smaller than 100 MB
   C = FILTER B by fileSize > 100L  * 1024L * 1024L;
   -- Gather all the files of the same byte size
   D = GROUP C by fileSize;
   -- Generate path, num of duplicates, list of duplicates
   E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
   -- Drop all the files where there are only one of them
   F = FILTER E by numDupes > 1L;
   -- Sort by the size of the files
   G = ORDER F by fileSize;
   -- Save results
   STORE G INTO '$outputFile';
 ----
   This script can be run against pig with the following command:
 ----
   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
 ----
   The output file's content will be similar to that below:
 ----
   1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
   1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
   1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
   1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
   1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
 ----
   Each line includes the file size in bytes that was found to be
   duplicated, the number of duplicates found, and a list of the
   duplicated paths. Files less than 100MB are ignored, providing a
   reasonable likelihood that files of these exact sizes may be
   duplicates.