HADOOP-10908. Common needs updates for shell rewrite (aw)

This commit is contained in:
Allen Wittenauer 2015-01-05 14:26:41 -08:00
parent 41d72cbd48
commit 94d342e607
5 changed files with 558 additions and 489 deletions

View File

@ -344,6 +344,8 @@ Trunk (Unreleased)
HADOOP-11397. Can't override HADOOP_IDENT_STRING (Kengo Seki via aw)
HADOOP-10908. Common needs updates for shell rewrite (aw)
OPTIMIZATIONS
HADOOP-7761. Improve the performance of raw comparisons. (todd)

View File

@ -11,83 +11,81 @@
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Map Reduce Next Generation-${project.version} - Cluster Setup
Hadoop ${project.version} - Cluster Setup
---
---
${maven.build.timestamp}
%{toc|section=1|fromDepth=0}
Hadoop MapReduce Next Generation - Cluster Setup
Hadoop Cluster Setup
* {Purpose}
This document describes how to install, configure and manage non-trivial
This document describes how to install and configure
Hadoop clusters ranging from a few nodes to extremely large clusters
with thousands of nodes.
with thousands of nodes. To play with Hadoop, you may first want to
install it on a single machine (see {{{./SingleCluster.html}Single Node Setup}}).
To play with Hadoop, you may first want to install it on a single
machine (see {{{./SingleCluster.html}Single Node Setup}}).
This document does not cover advanced topics such as {{{./SecureMode.html}Security}} or
High Availability.
* {Prerequisites}
Download a stable version of Hadoop from Apache mirrors.
* Install Java. See the {{{http://wiki.apache.org/hadoop/HadoopJavaVersions}Hadoop Wiki}} for known good versions.
* Download a stable version of Hadoop from Apache mirrors.
* {Installation}
Installing a Hadoop cluster typically involves unpacking the software on all
the machines in the cluster or installing RPMs.
the machines in the cluster or installing it via a packaging system as
appropriate for your operating system. It is important to divide up the hardware
into functions.
Typically one machine in the cluster is designated as the NameNode and
another machine the as ResourceManager, exclusively. These are the masters.
another machine the as ResourceManager, exclusively. These are the masters. Other
services (such as Web App Proxy Server and MapReduce Job History server) are usually
run either on dedicated hardware or on shared infrastrucutre, depending upon the load.
The rest of the machines in the cluster act as both DataNode and NodeManager.
These are the slaves.
* {Running Hadoop in Non-Secure Mode}
* {Configuring Hadoop in Non-Secure Mode}
The following sections describe how to configure a Hadoop cluster.
{Configuration Files}
Hadoop configuration is driven by two types of important configuration files:
Hadoop's Java configuration is driven by two types of important configuration files:
* Read-only default configuration - <<<core-default.xml>>>,
<<<hdfs-default.xml>>>, <<<yarn-default.xml>>> and
<<<mapred-default.xml>>>.
* Site-specific configuration - <<conf/core-site.xml>>,
<<conf/hdfs-site.xml>>, <<conf/yarn-site.xml>> and
<<conf/mapred-site.xml>>.
* Site-specific configuration - <<<etc/hadoop/core-site.xml>>>,
<<<etc/hadoop/hdfs-site.xml>>>, <<<etc/hadoop/yarn-site.xml>>> and
<<<etc/hadoop/mapred-site.xml>>>.
Additionally, you can control the Hadoop scripts found in the bin/
directory of the distribution, by setting site-specific values via the
<<conf/hadoop-env.sh>> and <<yarn-env.sh>>.
{Site Configuration}
Additionally, you can control the Hadoop scripts found in the bin/
directory of the distribution, by setting site-specific values via the
<<<etc/hadoop/hadoop-env.sh>>> and <<<etc/hadoop/yarn-env.sh>>>.
To configure the Hadoop cluster you will need to configure the
<<<environment>>> in which the Hadoop daemons execute as well as the
<<<configuration parameters>>> for the Hadoop daemons.
The Hadoop daemons are NameNode/DataNode and ResourceManager/NodeManager.
HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN damones
are ResourceManager, NodeManager, and WebAppProxy. If MapReduce is to be
used, then the MapReduce Job History Server will also be running. For
large installations, these are generally running on separate hosts.
** {Configuring Environment of Hadoop Daemons}
Administrators should use the <<conf/hadoop-env.sh>> and
<<conf/yarn-env.sh>> script to do site-specific customization of the
Hadoop daemons' process environment.
Administrators should use the <<<etc/hadoop/hadoop-env.sh>>> and optionally the
<<<etc/hadoop/mapred-env.sh>>> and <<<etc/hadoop/yarn-env.sh>>> scripts to do
site-specific customization of the Hadoop daemons' process environment.
At the very least you should specify the <<<JAVA_HOME>>> so that it is
At the very least, you must specify the <<<JAVA_HOME>>> so that it is
correctly defined on each remote node.
In most cases you should also specify <<<HADOOP_PID_DIR>>> and
<<<HADOOP_SECURE_DN_PID_DIR>>> to point to directories that can only be
written to by the users that are going to run the hadoop daemons.
Otherwise there is the potential for a symlink attack.
Administrators can configure individual daemons using the configuration
options shown below in the table:
@ -114,20 +112,42 @@ Hadoop MapReduce Next Generation - Cluster Setup
statement should be added in hadoop-env.sh :
----
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC"
----
See <<<etc/hadoop/hadoop-env.sh>>> for other examples.
Other useful configuration parameters that you can customize include:
* <<<HADOOP_LOG_DIR>>> / <<<YARN_LOG_DIR>>> - The directory where the
daemons' log files are stored. They are automatically created if they
don't exist.
* <<<HADOOP_PID_DIR>>> - The directory where the
daemons' process id files are stored.
* <<<HADOOP_HEAPSIZE>>> / <<<YARN_HEAPSIZE>>> - The maximum amount of
heapsize to use, in MB e.g. if the varibale is set to 1000 the heap
will be set to 1000MB. This is used to configure the heap
size for the daemon. By default, the value is 1000. If you want to
configure the values separately for each deamon you can use.
* <<<HADOOP_LOG_DIR>>> - The directory where the
daemons' log files are stored. Log files are automatically created
if they don't exist.
* <<<HADOOP_HEAPSIZE_MAX>>> - The maximum amount of
memory to use for the Java heapsize. Units supported by the JVM
are also supported here. If no unit is present, it will be assumed
the number is in megabytes. By default, Hadoop will let the JVM
determine how much to use. This value can be overriden on
a per-daemon basis using the appropriate <<<_OPTS>>> variable listed above.
For example, setting <<<HADOOP_HEAPSIZE_MAX=1g>>> and
<<<HADOOP_NAMENODE_OPTS="-Xmx5g">>> will configure the NameNode with 5GB heap.
In most cases, you should specify the <<<HADOOP_PID_DIR>>> and
<<<HADOOP_LOG_DIR>>> directories such that they can only be
written to by the users that are going to run the hadoop daemons.
Otherwise there is the potential for a symlink attack.
It is also traditional to configure <<<HADOOP_PREFIX>>> in the system-wide
shell environment configuration. For example, a simple script inside
<<</etc/profile.d>>>:
---
HADOOP_PREFIX=/path/to/hadoop
export HADOOP_PREFIX
---
*--------------------------------------+--------------------------------------+
|| Daemon || Environment Variable |
@ -141,12 +161,12 @@ Hadoop MapReduce Next Generation - Cluster Setup
| Map Reduce Job History Server | HADOOP_JOB_HISTORYSERVER_HEAPSIZE |
*--------------------------------------+--------------------------------------+
** {Configuring the Hadoop Daemons in Non-Secure Mode}
** {Configuring the Hadoop Daemons}
This section deals with important parameters to be specified in
the given configuration files:
* <<<conf/core-site.xml>>>
* <<<etc/hadoop/core-site.xml>>>
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
@ -157,7 +177,7 @@ Hadoop MapReduce Next Generation - Cluster Setup
| | | Size of read/write buffer used in SequenceFiles. |
*-------------------------+-------------------------+------------------------+
* <<<conf/hdfs-site.xml>>>
* <<<etc/hadoop/hdfs-site.xml>>>
* Configurations for NameNode:
@ -195,7 +215,7 @@ Hadoop MapReduce Next Generation - Cluster Setup
| | | stored in all named directories, typically on different devices. |
*-------------------------+-------------------------+------------------------+
* <<<conf/yarn-site.xml>>>
* <<<etc/hadoop/yarn-site.xml>>>
* Configurations for ResourceManager and NodeManager:
@ -341,9 +361,7 @@ Hadoop MapReduce Next Generation - Cluster Setup
| | | Be careful, set this too small and you will spam the name node. |
*-------------------------+-------------------------+------------------------+
* <<<conf/mapred-site.xml>>>
* <<<etc/hadoop/mapred-site.xml>>>
* Configurations for MapReduce Applications:
@ -395,22 +413,6 @@ Hadoop MapReduce Next Generation - Cluster Setup
| | | Directory where history files are managed by the MR JobHistory Server. |
*-------------------------+-------------------------+------------------------+
* {Hadoop Rack Awareness}
The HDFS and the YARN components are rack-aware.
The NameNode and the ResourceManager obtains the rack information of the
slaves in the cluster by invoking an API <resolve> in an administrator
configured module.
The API resolves the DNS name (also IP address) to a rack id.
The site-specific module to use can be configured using the configuration
item <<<topology.node.switch.mapping.impl>>>. The default implementation
of the same runs a script/command configured using
<<<topology.script.file.name>>>. If <<<topology.script.file.name>>> is
not set, the rack id </default-rack> is returned for any passed IP address.
* {Monitoring Health of NodeManagers}
Hadoop provides a mechanism by which administrators can configure the
@ -433,7 +435,7 @@ Hadoop MapReduce Next Generation - Cluster Setup
node was healthy is also displayed on the web interface.
The following parameters can be used to control the node health
monitoring script in <<<conf/yarn-site.xml>>>.
monitoring script in <<<etc/hadoop/yarn-site.xml>>>.
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
@ -465,165 +467,87 @@ Hadoop MapReduce Next Generation - Cluster Setup
disk is either raided or a failure in the boot disk is identified by the
health checker script.
* {Slaves file}
* {Slaves File}
Typically you choose one machine in the cluster to act as the NameNode and
one machine as to act as the ResourceManager, exclusively. The rest of the
machines act as both a DataNode and NodeManager and are referred to as
<slaves>.
List all slave hostnames or IP addresses in your <<<etc/hadoop/slaves>>>
file, one per line. Helper scripts (described below) will use the
<<<etc/hadoop/slaves>>> file to run commands on many hosts at once. It is not
used for any of the Java-based Hadoop configuration. In order
to use this functionality, ssh trusts (via either passphraseless ssh or
some other means, such as Kerberos) must be established for the accounts
used to run Hadoop.
List all slave hostnames or IP addresses in your <<<conf/slaves>>> file,
one per line.
* {Hadoop Rack Awareness}
Many Hadoop components are rack-aware and take advantage of the
network topology for performance and safety. Hadoop daemons obtain the
rack information of the slaves in the cluster by invoking an administrator
configured module. See the {{{./RackAwareness.html}Rack Awareness}}
documentation for more specific information.
It is highly recommended configuring rack awareness prior to starting HDFS.
* {Logging}
Hadoop uses the Apache log4j via the Apache Commons Logging framework for
logging. Edit the <<<conf/log4j.properties>>> file to customize the
Hadoop uses the {{{http://logging.apache.org/log4j/2.x/}Apache log4j}} via the Apache Commons Logging framework for
logging. Edit the <<<etc/hadoop/log4j.properties>>> file to customize the
Hadoop daemons' logging configuration (log-formats and so on).
* {Operating the Hadoop Cluster}
Once all the necessary configuration is complete, distribute the files to the
<<<HADOOP_CONF_DIR>>> directory on all the machines.
<<<HADOOP_CONF_DIR>>> directory on all the machines. This should be the
same directory on all machines.
** Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and YARN
cluster.
Format a new distributed filesystem:
----
$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
----
Start the HDFS with the following command, run on the designated NameNode:
----
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
----
Run a script to start DataNodes on all slaves:
----
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
----
Start the YARN with the following command, run on the designated
ResourceManager:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
----
Run a script to start NodeManagers on all slaves:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
----
Start a standalone WebAppProxy server. If multiple servers
are used with load balancing it should be run on each of them:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh start proxyserver --config $HADOOP_CONF_DIR
----
Start the MapReduce JobHistory Server with the following command, run on the
designated server:
----
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
----
** Hadoop Shutdown
Stop the NameNode with the following command, run on the designated
NameNode:
----
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode
----
Run a script to stop DataNodes on all slaves:
----
$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode
----
Stop the ResourceManager with the following command, run on the designated
ResourceManager:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager
----
Run a script to stop NodeManagers on all slaves:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager
----
Stop the WebAppProxy server. If multiple servers are used with load
balancing it should be run on each of them:
----
$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh stop proxyserver --config $HADOOP_CONF_DIR
----
Stop the MapReduce JobHistory Server with the following command, run on the
designated server:
----
$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR
----
* {Operating the Hadoop Cluster}
Once all the necessary configuration is complete, distribute the files to the
<<<HADOOP_CONF_DIR>>> directory on all the machines.
This section also describes the various Unix users who should be starting the
various components and uses the same Unix accounts and groups used previously:
In general, it is recommended that HDFS and YARN run as separate users.
In the majority of installations, HDFS processes execute as 'hdfs'. YARN
is typically using the 'yarn' account.
** Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and YARN
cluster.
Format a new distributed filesystem as <hdfs>:
The first time you bring up HDFS, it must be formatted. Format a new
distributed filesystem as <hdfs>:
----
[hdfs]$ $HADOOP_PREFIX/bin/hdfs namenode -format <cluster_name>
----
Start the HDFS with the following command, run on the designated NameNode
as <hdfs>:
Start the HDFS NameNode with the following command on the
designated node as <hdfs>:
----
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start namenode
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start namenode
----
Run a script to start DataNodes on all slaves as <root> with a special
environment variable <<<HADOOP_SECURE_DN_USER>>> set to <hdfs>:
Start a HDFS DataNode with the following command on each
designated node as <hdfs>:
----
[root]$ HADOOP_SECURE_DN_USER=hdfs $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs start datanode
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon start datanode
----
If <<<etc/hadoop/slaves>>> and ssh trusted access is configured
(see {{{./SingleCluster.html}Single Node Setup}}), all of the
HDFS processes can be started with a utility script. As <hdfs>:
----
[hdfs]$ $HADOOP_PREFIX/sbin/start-dfs.sh
----
Start the YARN with the following command, run on the designated
ResourceManager as <yarn>:
----
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start resourcemanager
[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start resourcemanager
----
Run a script to start NodeManagers on all slaves as <yarn>:
Run a script to start a NodeManager on each designated host as <yarn>:
----
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR start nodemanager
[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start nodemanager
----
Start a standalone WebAppProxy server. Run on the WebAppProxy
@ -631,14 +555,22 @@ $ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOO
it should be run on each of them:
----
[yarn]$ $HADOOP_YARN_HOME/bin/yarn start proxyserver --config $HADOOP_CONF_DIR
[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon start proxyserver
----
Start the MapReduce JobHistory Server with the following command, run on the
designated server as <mapred>:
If <<<etc/hadoop/slaves>>> and ssh trusted access is configured
(see {{{./SingleCluster.html}Single Node Setup}}), all of the
YARN processes can be started with a utility script. As <yarn>:
----
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh start historyserver --config $HADOOP_CONF_DIR
[yarn]$ $HADOOP_PREFIX/sbin/start-yarn.sh
----
Start the MapReduce JobHistory Server with the following command, run
on the designated server as <mapred>:
----
[mapred]$ $HADOOP_PREFIX/bin/mapred --daemon start historyserver
----
** Hadoop Shutdown
@ -647,26 +579,42 @@ $ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOO
as <hdfs>:
----
[hdfs]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop namenode
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop namenode
----
Run a script to stop DataNodes on all slaves as <root>:
Run a script to stop a DataNode as <hdfs>:
----
[root]$ $HADOOP_PREFIX/sbin/hadoop-daemon.sh --config $HADOOP_CONF_DIR --script hdfs stop datanode
[hdfs]$ $HADOOP_PREFIX/bin/hdfs --daemon stop datanode
----
If <<<etc/hadoop/slaves>>> and ssh trusted access is configured
(see {{{./SingleCluster.html}Single Node Setup}}), all of the
HDFS processes may be stopped with a utility script. As <hdfs>:
----
[hdfs]$ $HADOOP_PREFIX/sbin/stop-dfs.sh
----
Stop the ResourceManager with the following command, run on the designated
ResourceManager as <yarn>:
----
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop resourcemanager
[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon stop resourcemanager
----
Run a script to stop NodeManagers on all slaves as <yarn>:
Run a script to stop a NodeManager on a slave as <yarn>:
----
[yarn]$ $HADOOP_YARN_HOME/sbin/yarn-daemon.sh --config $HADOOP_CONF_DIR stop nodemanager
[yarn]$ $HADOOP_PREFIX/bin/yarn --daemon stop nodemanager
----
If <<<etc/hadoop/slaves>>> and ssh trusted access is configured
(see {{{./SingleCluster.html}Single Node Setup}}), all of the
YARN processes can be stopped with a utility script. As <yarn>:
----
[yarn]$ $HADOOP_PREFIX/sbin/stop-yarn.sh
----
Stop the WebAppProxy server. Run on the WebAppProxy server as
@ -674,14 +622,14 @@ $ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOO
should be run on each of them:
----
[yarn]$ $HADOOP_YARN_HOME/bin/yarn stop proxyserver --config $HADOOP_CONF_DIR
[yarn]$ $HADOOP_PREFIX/bin/yarn stop proxyserver
----
Stop the MapReduce JobHistory Server with the following command, run on the
designated server as <mapred>:
----
[mapred]$ $HADOOP_PREFIX/sbin/mr-jobhistory-daemon.sh stop historyserver --config $HADOOP_CONF_DIR
[mapred]$ $HADOOP_PREFIX/bin/mapred --daemon stop historyserver
----
* {Web Interfaces}

View File

@ -21,102 +21,161 @@
%{toc}
Overview
Hadoop Commands Guide
All hadoop commands are invoked by the <<<bin/hadoop>>> script. Running the
hadoop script without any arguments prints the description for all
commands.
* Overview
Usage: <<<hadoop [--config confdir] [--loglevel loglevel] [COMMAND]
[GENERIC_OPTIONS] [COMMAND_OPTIONS]>>>
All of the Hadoop commands and subprojects follow the same basic structure:
Hadoop has an option parsing framework that employs parsing generic
options as well as running classes.
Usage: <<<shellcommand [SHELL_OPTIONS] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]>>>
*--------+---------+
|| FIELD || Description
*-----------------------+---------------+
|| COMMAND_OPTION || Description
| shellcommand | The command of the project being invoked. For example,
| Hadoop common uses <<<hadoop>>>, HDFS uses <<<hdfs>>>,
| and YARN uses <<<yarn>>>.
*---------------+-------------------+
| SHELL_OPTIONS | Options that the shell processes prior to executing Java.
*-----------------------+---------------+
| <<<--config confdir>>>| Overwrites the default Configuration directory. Default is <<<${HADOOP_HOME}/conf>>>.
| COMMAND | Action to perform.
*-----------------------+---------------+
| <<<--loglevel loglevel>>>| Overwrites the log level. Valid log levels are
| | FATAL, ERROR, WARN, INFO, DEBUG, and TRACE.
| | Default is INFO.
| GENERIC_OPTIONS | The common set of options supported by
| multiple commands.
*-----------------------+---------------+
| GENERIC_OPTIONS | The common set of options supported by multiple commands.
| COMMAND_OPTIONS | Various commands with their options are described in the following sections. The commands have been grouped into User Commands and Administration Commands.
| COMMAND_OPTIONS | Various commands with their options are
| described in this documention for the
| Hadoop common sub-project. HDFS and YARN are
| covered in other documents.
*-----------------------+---------------+
Generic Options
** {Shell Options}
The following options are supported by {{dfsadmin}}, {{fs}}, {{fsck}},
{{job}} and {{fetchdt}}. Applications should implement
{{{../../api/org/apache/hadoop/util/Tool.html}Tool}} to support
GenericOptions.
All of the shell commands will accept a common set of options. For some commands,
these options are ignored. For example, passing <<<---hostnames>>> on a
command that only executes on a single host will be ignored.
*-----------------------+---------------+
|| SHELL_OPTION || Description
*-----------------------+---------------+
| <<<--buildpaths>>> | Enables developer versions of jars.
*-----------------------+---------------+
| <<<--config confdir>>> | Overwrites the default Configuration
| directory. Default is <<<${HADOOP_PREFIX}/conf>>>.
*-----------------------+----------------+
| <<<--daemon mode>>> | If the command supports daemonization (e.g.,
| <<<hdfs namenode>>>), execute in the appropriate
| mode. Supported modes are <<<start>>> to start the
| process in daemon mode, <<<stop>>> to stop the
| process, and <<<status>>> to determine the active
| status of the process. <<<status>>> will return
| an {{{http://refspecs.linuxbase.org/LSB_3.0.0/LSB-generic/LSB-generic/iniscrptact.html}LSB-compliant}} result code.
| If no option is provided, commands that support
| daemonization will run in the foreground.
*-----------------------+---------------+
| <<<--debug>>> | Enables shell level configuration debugging information
*-----------------------+---------------+
| <<<--help>>> | Shell script usage information.
*-----------------------+---------------+
| <<<--hostnames>>> | A space delimited list of hostnames where to execute
| a multi-host subcommand. By default, the content of
| the <<<slaves>>> file is used.
*-----------------------+----------------+
| <<<--hosts>>> | A file that contains a list of hostnames where to execute
| a multi-host subcommand. By default, the content of the
| <<<slaves>>> file is used.
*-----------------------+----------------+
| <<<--loglevel loglevel>>> | Overrides the log level. Valid log levels are
| | FATAL, ERROR, WARN, INFO, DEBUG, and TRACE.
| | Default is INFO.
*-----------------------+---------------+
** {Generic Options}
Many subcommands honor a common set of configuration options to alter their behavior:
*------------------------------------------------+-----------------------------+
|| GENERIC_OPTION || Description
*------------------------------------------------+-----------------------------+
|<<<-conf \<configuration file\> >>> | Specify an application
| configuration file.
*------------------------------------------------+-----------------------------+
|<<<-D \<property\>=\<value\> >>> | Use value for given property.
*------------------------------------------------+-----------------------------+
|<<<-jt \<local\> or \<resourcemanager:port\>>>> | Specify a ResourceManager.
| Applies only to job.
*------------------------------------------------+-----------------------------+
|<<<-files \<comma separated list of files\> >>> | Specify comma separated files
| to be copied to the map
| reduce cluster. Applies only
| to job.
*------------------------------------------------+-----------------------------+
|<<<-libjars \<comma seperated list of jars\> >>>| Specify comma separated jar
| files to include in the
| classpath. Applies only to
| job.
*------------------------------------------------+-----------------------------+
|<<<-archives \<comma separated list of archives\> >>> | Specify comma separated
| archives to be unarchived on
| the compute machines. Applies
| only to job.
*------------------------------------------------+-----------------------------+
|<<<-conf \<configuration file\> >>> | Specify an application
| configuration file.
*------------------------------------------------+-----------------------------+
|<<<-D \<property\>=\<value\> >>> | Use value for given property.
*------------------------------------------------+-----------------------------+
|<<<-files \<comma separated list of files\> >>> | Specify comma separated files
| to be copied to the map
| reduce cluster. Applies only
| to job.
*------------------------------------------------+-----------------------------+
|<<<-jt \<local\> or \<resourcemanager:port\>>>> | Specify a ResourceManager.
| Applies only to job.
*------------------------------------------------+-----------------------------+
|<<<-libjars \<comma seperated list of jars\> >>>| Specify comma separated jar
| files to include in the
| classpath. Applies only to
| job.
*------------------------------------------------+-----------------------------+
User Commands
Hadoop Common Commands
All of these commands are executed from the <<<hadoop>>> shell command. They
have been broken up into {{User Commands}} and
{{Admininistration Commands}}.
* User Commands
Commands useful for users of a hadoop cluster.
* <<<archive>>>
** <<<archive>>>
Creates a hadoop archive. More information can be found at
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html}
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html}
Hadoop Archives Guide}}.
* <<<credential>>>
** <<<checknative>>>
Command to manage credentials, passwords and secrets within credential providers.
Usage: <<<hadoop checknative [-a] [-h] >>>
The CredentialProvider API in Hadoop allows for the separation of applications
and how they store their required passwords/secrets. In order to indicate
a particular provider type and location, the user must provide the
<hadoop.security.credential.provider.path> configuration element in core-site.xml
or use the command line option <<<-provider>>> on each of the following commands.
This provider path is a comma-separated list of URLs that indicates the type and
location of a list of providers that should be consulted.
For example, the following path:
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| -a | Check all libraries are available.
*-----------------+-----------------------------------------------------------+
| -h | print help
*-----------------+-----------------------------------------------------------+
<<<user:///,jceks://file/tmp/test.jceks,jceks://hdfs@nn1.example.com/my/path/test.jceks>>>
This command checks the availability of the Hadoop native code. See
{{{NativeLibraries.html}}} for more information. By default, this command
only checks the availability of libhadoop.
indicates that the current user's credentials file should be consulted through
the User Provider, that the local file located at <<</tmp/test.jceks>>> is a Java Keystore
Provider and that the file located within HDFS at <<<nn1.example.com/my/path/test.jceks>>>
is also a store for a Java Keystore Provider.
** <<<classpath>>>
When utilizing the credential command it will often be for provisioning a password
or secret to a particular credential store provider. In order to explicitly
indicate which provider store to use the <<<-provider>>> option should be used. Otherwise,
given a path of multiple providers, the first non-transient provider will be used.
This may or may not be the one that you intended.
Usage: <<<hadoop classpath [--glob|--jar <path>|-h|--help]>>>
Example: <<<-provider jceks://file/tmp/test.jceks>>>
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| --glob | expand wildcards
*-----------------+-----------------------------------------------------------+
| --jar <path> | write classpath as manifest in jar named <path>
*-----------------+-----------------------------------------------------------+
| -h, --help | print help
*-----------------+-----------------------------------------------------------+
Prints the class path needed to get the Hadoop jar and the required
libraries. If called without arguments, then prints the classpath set up by
the command scripts, which is likely to contain wildcards in the classpath
entries. Additional options print the classpath after wildcard expansion or
write the classpath into the manifest of a jar file. The latter is useful in
environments where wildcards cannot be used and the expanded classpath exceeds
the maximum supported command line length.
** <<<credential>>>
Usage: <<<hadoop credential <subcommand> [options]>>>
@ -143,109 +202,96 @@ User Commands
| indicated.
*-------------------+-------------------------------------------------------+
* <<<distcp>>>
Command to manage credentials, passwords and secrets within credential providers.
The CredentialProvider API in Hadoop allows for the separation of applications
and how they store their required passwords/secrets. In order to indicate
a particular provider type and location, the user must provide the
<hadoop.security.credential.provider.path> configuration element in core-site.xml
or use the command line option <<<-provider>>> on each of the following commands.
This provider path is a comma-separated list of URLs that indicates the type and
location of a list of providers that should be consulted. For example, the following path:
<<<user:///,jceks://file/tmp/test.jceks,jceks://hdfs@nn1.example.com/my/path/test.jceks>>>
indicates that the current user's credentials file should be consulted through
the User Provider, that the local file located at <<</tmp/test.jceks>>> is a Java Keystore
Provider and that the file located within HDFS at <<<nn1.example.com/my/path/test.jceks>>>
is also a store for a Java Keystore Provider.
When utilizing the credential command it will often be for provisioning a password
or secret to a particular credential store provider. In order to explicitly
indicate which provider store to use the <<<-provider>>> option should be used. Otherwise,
given a path of multiple providers, the first non-transient provider will be used.
This may or may not be the one that you intended.
Example: <<<-provider jceks://file/tmp/test.jceks>>>
** <<<distch>>>
Usage: <<<hadoop distch [-f urilist_url] [-i] [-log logdir] path:owner:group:permissions>>>
*-------------------+-------------------------------------------------------+
||COMMAND_OPTION || Description
*-------------------+-------------------------------------------------------+
| -f | List of objects to change
*----+------------+
| -i | Ignore failures
*----+------------+
| -log | Directory to log output
*-----+---------+
Change the ownership and permissions on many files at once.
** <<<distcp>>>
Copy file or directories recursively. More information can be found at
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html}
Hadoop DistCp Guide}}.
* <<<fs>>>
** <<<fs>>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#dfs}<<<hdfs dfs>>>}}
instead.
This command is documented in the {{{./FileSystemShell.html}File System Shell Guide}}. It is a synonym for <<<hdfs dfs>>> when HDFS is in use.
* <<<fsck>>>
** <<<jar>>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#fsck}<<<hdfs fsck>>>}}
instead.
Usage: <<<hadoop jar <jar> [mainClass] args...>>>
* <<<fetchdt>>>
Runs a jar file.
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#fetchdt}
<<<hdfs fetchdt>>>}} instead.
Use {{{../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html#jar}<<<yarn jar>>>}}
to launch YARN applications instead.
* <<<jar>>>
** <<<jnipath>>>
Runs a jar file. Users can bundle their Map Reduce code in a jar file and
execute it using this command.
Usage: <<<hadoop jnipath>>>
Usage: <<<hadoop jar <jar> [mainClass] args...>>>
Print the computed java.library.path.
The streaming jobs are run via this command. Examples can be referred from
Streaming examples
** <<<key>>>
Word count example is also run using jar command. It can be referred from
Wordcount example
Manage keys via the KeyProvider.
Use {{{../../hadoop-yarn/hadoop-yarn-site/YarnCommands.html#jar}<<<yarn jar>>>}}
to launch YARN applications instead.
** <<<trace>>>
* <<<job>>>
View and modify Hadoop tracing settings. See the {{{./Tracing.html}Tracing Guide}}.
Deprecated. Use
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#job}
<<<mapred job>>>}} instead.
* <<<pipes>>>
Deprecated. Use
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#pipes}
<<<mapred pipes>>>}} instead.
* <<<queue>>>
Deprecated. Use
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapredCommands.html#queue}
<<<mapred queue>>>}} instead.
* <<<version>>>
Prints the version.
** <<<version>>>
Usage: <<<hadoop version>>>
* <<<CLASSNAME>>>
Prints the version.
hadoop script can be used to invoke any class.
** <<<CLASSNAME>>>
Usage: <<<hadoop CLASSNAME>>>
Runs the class named <<<CLASSNAME>>>.
Runs the class named <<<CLASSNAME>>>. The class must be part of a package.
* <<<classpath>>>
Prints the class path needed to get the Hadoop jar and the required
libraries. If called without arguments, then prints the classpath set up by
the command scripts, which is likely to contain wildcards in the classpath
entries. Additional options print the classpath after wildcard expansion or
write the classpath into the manifest of a jar file. The latter is useful in
environments where wildcards cannot be used and the expanded classpath exceeds
the maximum supported command line length.
Usage: <<<hadoop classpath [--glob|--jar <path>|-h|--help]>>>
*-----------------+-----------------------------------------------------------+
|| COMMAND_OPTION || Description
*-----------------+-----------------------------------------------------------+
| --glob | expand wildcards
*-----------------+-----------------------------------------------------------+
| --jar <path> | write classpath as manifest in jar named <path>
*-----------------+-----------------------------------------------------------+
| -h, --help | print help
*-----------------+-----------------------------------------------------------+
Administration Commands
* {Administration Commands}
Commands useful for administrators of a hadoop cluster.
* <<<balancer>>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#balancer}
<<<hdfs balancer>>>}} instead.
* <<<daemonlog>>>
Get/Set the log level for each daemon.
** <<<daemonlog>>>
Usage: <<<hadoop daemonlog -getlevel <host:port> <name> >>>
Usage: <<<hadoop daemonlog -setlevel <host:port> <name> <level> >>>
@ -262,22 +308,20 @@ Administration Commands
| connects to http://<host:port>/logLevel?log=<name>
*------------------------------+-----------------------------------------------------------+
* <<<datanode>>>
Get/Set the log level for each daemon.
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#datanode}
<<<hdfs datanode>>>}} instead.
* Files
* <<<dfsadmin>>>
** <<etc/hadoop/hadoop-env.sh>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#dfsadmin}
<<<hdfs dfsadmin>>>}} instead.
This file stores the global settings used by all Hadoop shell commands.
* <<<namenode>>>
** <<etc/hadoop/hadoop-user-functions.sh>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#namenode}
<<<hdfs namenode>>>}} instead.
This file allows for advanced users to override some shell functionality.
* <<<secondarynamenode>>>
** <<~/.hadooprc>>
Deprecated, use {{{../hadoop-hdfs/HDFSCommands.html#secondarynamenode}
<<<hdfs secondarynamenode>>>}} instead.
This stores the personal environment for an individual user. It is
processed after the hadoop-env.sh and hadoop-user-functions.sh files
and can contain the same settings.

View File

@ -45,46 +45,62 @@ bin/hadoop fs <args>
Differences are described with each of the commands. Error information is
sent to stderr and the output is sent to stdout.
appendToFile
If HDFS is being used, <<<hdfs dfs>>> is a synonym.
Usage: <<<hdfs dfs -appendToFile <localsrc> ... <dst> >>>
See the {{{./CommandsManual.html}Commands Manual}} for generic shell options.
* appendToFile
Usage: <<<hadoop fs -appendToFile <localsrc> ... <dst> >>>
Append single src, or multiple srcs from local file system to the
destination file system. Also reads input from stdin and appends to
destination file system.
* <<<hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile>>>
* <<<hadoop fs -appendToFile localfile /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile>>>
* <<<hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile>>>
Reads the input from stdin.
Exit Code:
Returns 0 on success and 1 on error.
cat
* cat
Usage: <<<hdfs dfs -cat URI [URI ...]>>>
Usage: <<<hadoop fs -cat URI [URI ...]>>>
Copies source paths to stdout.
Example:
* <<<hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hdfs dfs -cat file:///file3 /user/hadoop/file4>>>
* <<<hadoop fs -cat file:///file3 /user/hadoop/file4>>>
Exit Code:
Returns 0 on success and -1 on error.
chgrp
* checksum
Usage: <<<hdfs dfs -chgrp [-R] GROUP URI [URI ...]>>>
Usage: <<<hadoop fs -checksum URI>>>
Returns the checksum information of a file.
Example:
* <<<hadoop fs -checksum hdfs://nn1.example.com/file1>>>
* <<<hadoop fs -checksum file:///etc/hosts>>>
* chgrp
Usage: <<<hadoop fs -chgrp [-R] GROUP URI [URI ...]>>>
Change group association of files. The user must be the owner of files, or
else a super-user. Additional information is in the
@ -94,9 +110,9 @@ chgrp
* The -R option will make the change recursively through the directory structure.
chmod
* chmod
Usage: <<<hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]>>>
Usage: <<<hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]>>>
Change the permissions of files. With -R, make the change recursively
through the directory structure. The user must be the owner of the file, or
@ -107,9 +123,9 @@ chmod
* The -R option will make the change recursively through the directory structure.
chown
* chown
Usage: <<<hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]>>>
Usage: <<<hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]>>>
Change the owner of files. The user must be a super-user. Additional information
is in the {{{../hadoop-hdfs/HdfsPermissionsGuide.html}Permissions Guide}}.
@ -118,9 +134,9 @@ chown
* The -R option will make the change recursively through the directory structure.
copyFromLocal
* copyFromLocal
Usage: <<<hdfs dfs -copyFromLocal <localsrc> URI>>>
Usage: <<<hadoop fs -copyFromLocal <localsrc> URI>>>
Similar to put command, except that the source is restricted to a local
file reference.
@ -129,16 +145,16 @@ copyFromLocal
* The -f option will overwrite the destination if it already exists.
copyToLocal
* copyToLocal
Usage: <<<hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst> >>>
Usage: <<<hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst> >>>
Similar to get command, except that the destination is restricted to a
local file reference.
count
* count
Usage: <<<hdfs dfs -count [-q] [-h] <paths> >>>
Usage: <<<hadoop fs -count [-q] [-h] <paths> >>>
Count the number of directories, files and bytes under the paths that match
the specified file pattern. The output columns with -count are: DIR_COUNT,
@ -151,19 +167,19 @@ count
Example:
* <<<hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hdfs dfs -count -q hdfs://nn1.example.com/file1>>>
* <<<hadoop fs -count -q hdfs://nn1.example.com/file1>>>
* <<<hdfs dfs -count -q -h hdfs://nn1.example.com/file1>>>
* <<<hadoop fs -count -q -h hdfs://nn1.example.com/file1>>>
Exit Code:
Returns 0 on success and -1 on error.
cp
* cp
Usage: <<<hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest> >>>
Usage: <<<hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest> >>>
Copy files from source to destination. This command allows multiple sources
as well in which case the destination must be a directory.
@ -187,17 +203,41 @@ cp
Example:
* <<<hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir>>>
* <<<hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir>>>
Exit Code:
Returns 0 on success and -1 on error.
du
* createSnapshot
Usage: <<<hdfs dfs -du [-s] [-h] URI [URI ...]>>>
See {{{../hadoop-hdfs/HdfsSnapshots.html}HDFS Snapshots Guide}}.
* deleteSnapshot
See {{{../hadoop-hdfs/HdfsSnapshots.html}HDFS Snapshots Guide}}.
* df
Usage: <<<hadoop fs -df [-h] URI [URI ...]>>>
Displays free space.
Options:
* The -h option will format file sizes in a "human-readable" fashion (e.g
64.0m instead of 67108864)
Example:
* <<<hadoop dfs -df /user/hadoop/dir1>>>
* du
Usage: <<<hadoop fs -du [-s] [-h] URI [URI ...]>>>
Displays sizes of files and directories contained in the given directory or
the length of a file in case its just a file.
@ -212,29 +252,29 @@ du
Example:
* hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
* <<<hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1>>>
Exit Code:
Returns 0 on success and -1 on error.
dus
* dus
Usage: <<<hdfs dfs -dus <args> >>>
Usage: <<<hadoop fs -dus <args> >>>
Displays a summary of file lengths.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -du -s>>>.
<<Note:>> This command is deprecated. Instead use <<<hadoop fs -du -s>>>.
expunge
* expunge
Usage: <<<hdfs dfs -expunge>>>
Usage: <<<hadoop fs -expunge>>>
Empty the Trash. Refer to the {{{../hadoop-hdfs/HdfsDesign.html}
HDFS Architecture Guide}} for more information on the Trash feature.
find
* find
Usage: <<<hdfs dfs -find <path> ... <expression> ... >>>
Usage: <<<hadoop fs -find <path> ... <expression> ... >>>
Finds all files that match the specified expression and applies selected
actions to them. If no <path> is specified then defaults to the current
@ -269,15 +309,15 @@ find
Example:
<<<hdfs dfs -find / -name test -print>>>
<<<hadoop fs -find / -name test -print>>>
Exit Code:
Returns 0 on success and -1 on error.
get
* get
Usage: <<<hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst> >>>
Usage: <<<hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> >>>
Copy files to the local file system. Files that fail the CRC check may be
copied with the -ignorecrc option. Files and CRCs may be copied using the
@ -285,17 +325,17 @@ get
Example:
* <<<hdfs dfs -get /user/hadoop/file localfile>>>
* <<<hadoop fs -get /user/hadoop/file localfile>>>
* <<<hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile>>>
* <<<hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile>>>
Exit Code:
Returns 0 on success and -1 on error.
getfacl
* getfacl
Usage: <<<hdfs dfs -getfacl [-R] <path> >>>
Usage: <<<hadoop fs -getfacl [-R] <path> >>>
Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
@ -308,17 +348,17 @@ getfacl
Examples:
* <<<hdfs dfs -getfacl /file>>>
* <<<hadoop fs -getfacl /file>>>
* <<<hdfs dfs -getfacl -R /dir>>>
* <<<hadoop fs -getfacl -R /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
getfattr
* getfattr
Usage: <<<hdfs dfs -getfattr [-R] {-n name | -d} [-e en] <path> >>>
Usage: <<<hadoop fs -getfattr [-R] {-n name | -d} [-e en] <path> >>>
Displays the extended attribute names and values (if any) for a file or
directory.
@ -337,26 +377,32 @@ getfattr
Examples:
* <<<hdfs dfs -getfattr -d /file>>>
* <<<hadoop fs -getfattr -d /file>>>
* <<<hdfs dfs -getfattr -R -n user.myAttr /dir>>>
* <<<hadoop fs -getfattr -R -n user.myAttr /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
getmerge
* getmerge
Usage: <<<hdfs dfs -getmerge <src> <localdst> [addnl]>>>
Usage: <<<hadoop fs -getmerge <src> <localdst> [addnl]>>>
Takes a source directory and a destination file as input and concatenates
files in src into the destination local file. Optionally addnl can be set to
enable adding a newline character at the
end of each file.
ls
* help
Usage: <<<hdfs dfs -ls [-R] <args> >>>
Usage: <<<hadoop fs -help>>>
Return usage output.
* ls
Usage: <<<hadoop fs -ls [-R] <args> >>>
Options:
@ -377,23 +423,23 @@ permissions userid groupid modification_date modification_time dirname
Example:
* <<<hdfs dfs -ls /user/hadoop/file1>>>
* <<<hadoop fs -ls /user/hadoop/file1>>>
Exit Code:
Returns 0 on success and -1 on error.
lsr
* lsr
Usage: <<<hdfs dfs -lsr <args> >>>
Usage: <<<hadoop fs -lsr <args> >>>
Recursive version of ls.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -ls -R>>>
<<Note:>> This command is deprecated. Instead use <<<hadoop fs -ls -R>>>
mkdir
* mkdir
Usage: <<<hdfs dfs -mkdir [-p] <paths> >>>
Usage: <<<hadoop fs -mkdir [-p] <paths> >>>
Takes path uri's as argument and creates directories.
@ -403,30 +449,30 @@ mkdir
Example:
* <<<hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2>>>
* <<<hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2>>>
* <<<hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir>>>
* <<<hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir>>>
Exit Code:
Returns 0 on success and -1 on error.
moveFromLocal
* moveFromLocal
Usage: <<<hdfs dfs -moveFromLocal <localsrc> <dst> >>>
Usage: <<<hadoop fs -moveFromLocal <localsrc> <dst> >>>
Similar to put command, except that the source localsrc is deleted after
it's copied.
moveToLocal
* moveToLocal
Usage: <<<hdfs dfs -moveToLocal [-crc] <src> <dst> >>>
Usage: <<<hadoop fs -moveToLocal [-crc] <src> <dst> >>>
Displays a "Not implemented yet" message.
mv
* mv
Usage: <<<hdfs dfs -mv URI [URI ...] <dest> >>>
Usage: <<<hadoop fs -mv URI [URI ...] <dest> >>>
Moves files from source to destination. This command allows multiple sources
as well in which case the destination needs to be a directory. Moving files
@ -434,38 +480,42 @@ mv
Example:
* <<<hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1>>>
* <<<hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1>>>
Exit Code:
Returns 0 on success and -1 on error.
put
* put
Usage: <<<hdfs dfs -put <localsrc> ... <dst> >>>
Usage: <<<hadoop fs -put <localsrc> ... <dst> >>>
Copy single src, or multiple srcs from local file system to the destination
file system. Also reads input from stdin and writes to destination file
system.
* <<<hdfs dfs -put localfile /user/hadoop/hadoopfile>>>
* <<<hadoop fs -put localfile /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir>>>
* <<<hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir>>>
* <<<hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile>>>
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
rm
* renameSnapshot
Usage: <<<hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]>>>
See {{{../hadoop-hdfs/HdfsSnapshots.html}HDFS Snapshots Guide}}.
* rm
Usage: <<<hadoop fs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]>>>
Delete files specified as args.
@ -484,23 +534,37 @@ rm
Example:
* <<<hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/emptydir>>>
* <<<hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir>>>
Exit Code:
Returns 0 on success and -1 on error.
rmr
* rmdir
Usage: <<<hdfs dfs -rmr [-skipTrash] URI [URI ...]>>>
Usage: <<<hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]>>>
Delete a directory.
Options:
* --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.
Example:
* <<<hadoop fs -rmdir /user/hadoop/emptydir>>>
* rmr
Usage: <<<hadoop fs -rmr [-skipTrash] URI [URI ...]>>>
Recursive version of delete.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -rm -r>>>
<<Note:>> This command is deprecated. Instead use <<<hadoop fs -rm -r>>>
setfacl
* setfacl
Usage: <<<hdfs dfs -setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] >>>
Usage: <<<hadoop fs -setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] >>>
Sets Access Control Lists (ACLs) of files and directories.
@ -528,27 +592,27 @@ setfacl
Examples:
* <<<hdfs dfs -setfacl -m user:hadoop:rw- /file>>>
* <<<hadoop fs -setfacl -m user:hadoop:rw- /file>>>
* <<<hdfs dfs -setfacl -x user:hadoop /file>>>
* <<<hadoop fs -setfacl -x user:hadoop /file>>>
* <<<hdfs dfs -setfacl -b /file>>>
* <<<hadoop fs -setfacl -b /file>>>
* <<<hdfs dfs -setfacl -k /dir>>>
* <<<hadoop fs -setfacl -k /dir>>>
* <<<hdfs dfs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file>>>
* <<<hadoop fs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file>>>
* <<<hdfs dfs -setfacl -R -m user:hadoop:r-x /dir>>>
* <<<hadoop fs -setfacl -R -m user:hadoop:r-x /dir>>>
* <<<hdfs dfs -setfacl -m default:user:hadoop:r-x /dir>>>
* <<<hadoop fs -setfacl -m default:user:hadoop:r-x /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
setfattr
* setfattr
Usage: <<<hdfs dfs -setfattr {-n name [-v value] | -x name} <path> >>>
Usage: <<<hadoop fs -setfattr {-n name [-v value] | -x name} <path> >>>
Sets an extended attribute name and value for a file or directory.
@ -566,19 +630,19 @@ setfattr
Examples:
* <<<hdfs dfs -setfattr -n user.myAttr -v myValue /file>>>
* <<<hadoop fs -setfattr -n user.myAttr -v myValue /file>>>
* <<<hdfs dfs -setfattr -n user.noValue /file>>>
* <<<hadoop fs -setfattr -n user.noValue /file>>>
* <<<hdfs dfs -setfattr -x user.myAttr /file>>>
* <<<hadoop fs -setfattr -x user.myAttr /file>>>
Exit Code:
Returns 0 on success and non-zero on error.
setrep
* setrep
Usage: <<<hdfs dfs -setrep [-R] [-w] <numReplicas> <path> >>>
Usage: <<<hadoop fs -setrep [-R] [-w] <numReplicas> <path> >>>
Changes the replication factor of a file. If <path> is a directory then
the command recursively changes the replication factor of all files under
@ -593,28 +657,28 @@ setrep
Example:
* <<<hdfs dfs -setrep -w 3 /user/hadoop/dir1>>>
* <<<hadoop fs -setrep -w 3 /user/hadoop/dir1>>>
Exit Code:
Returns 0 on success and -1 on error.
stat
* stat
Usage: <<<hdfs dfs -stat URI [URI ...]>>>
Usage: <<<hadoop fs -stat URI [URI ...]>>>
Returns the stat information on the path.
Example:
* <<<hdfs dfs -stat path>>>
* <<<hadoop fs -stat path>>>
Exit Code:
Returns 0 on success and -1 on error.
tail
* tail
Usage: <<<hdfs dfs -tail [-f] URI>>>
Usage: <<<hadoop fs -tail [-f] URI>>>
Displays last kilobyte of the file to stdout.
@ -624,43 +688,54 @@ tail
Example:
* <<<hdfs dfs -tail pathname>>>
* <<<hadoop fs -tail pathname>>>
Exit Code:
Returns 0 on success and -1 on error.
test
* test
Usage: <<<hdfs dfs -test -[ezd] URI>>>
Usage: <<<hadoop fs -test -[defsz] URI>>>
Options:
* The -e option will check to see if the file exists, returning 0 if true.
* -d: f the path is a directory, return 0.
* The -z option will check to see if the file is zero length, returning 0 if true.
* -e: if the path exists, return 0.
* The -d option will check to see if the path is directory, returning 0 if true.
* -f: if the path is a file, return 0.
* -s: if the path is not empty, return 0.
* -z: if the file is zero length, return 0.
Example:
* <<<hdfs dfs -test -e filename>>>
* <<<hadoop fs -test -e filename>>>
text
* text
Usage: <<<hdfs dfs -text <src> >>>
Usage: <<<hadoop fs -text <src> >>>
Takes a source file and outputs the file in text format. The allowed formats
are zip and TextRecordInputStream.
touchz
* touchz
Usage: <<<hdfs dfs -touchz URI [URI ...]>>>
Usage: <<<hadoop fs -touchz URI [URI ...]>>>
Create a file of zero length.
Example:
* <<<hdfs dfs -touchz pathname>>>
* <<<hadoop fs -touchz pathname>>>
Exit Code:
Returns 0 on success and -1 on error.
* usage
Usage: <<<hadoop fs -usage command>>>
Return the help for an individual command.

View File

@ -11,12 +11,12 @@
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop MapReduce Next Generation ${project.version} - Setting up a Single Node Cluster.
Hadoop ${project.version} - Setting up a Single Node Cluster.
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
Hadoop - Setting up a Single Node Cluster.
%{toc|section=1|fromDepth=0}
@ -46,7 +46,9 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
HadoopJavaVersions}}.
[[2]] ssh must be installed and sshd must be running to use the Hadoop
scripts that manage remote Hadoop daemons.
scripts that manage remote Hadoop daemons if the optional start
and stop scripts are to be used. Additionally, it is recommmended that
pdsh also be installed for better ssh resource management.
** Installing Software
@ -57,7 +59,7 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
----
$ sudo apt-get install ssh
$ sudo apt-get install rsync
$ sudo apt-get install pdsh
----
* Download
@ -75,9 +77,6 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
----
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
# Assuming your installation directory is /usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
----
Try the following command:
@ -158,6 +157,7 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
----
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ chmod 0700 ~/.ssh/authorized_keys
----
** Execution
@ -228,7 +228,7 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
$ sbin/stop-dfs.sh
----
** YARN on Single Node
** YARN on a Single Node
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting
a few parameters and running ResourceManager daemon and NodeManager daemon
@ -239,7 +239,7 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
[[1]] Configure parameters as follows:
etc/hadoop/mapred-site.xml:
<<<etc/hadoop/mapred-site.xml>>>:
+---+
<configuration>
@ -250,7 +250,7 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
</configuration>
+---+
etc/hadoop/yarn-site.xml:
<<<etc/hadoop/yarn-site.xml>>>:
+---+
<configuration>