diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt index c7b2d0c5f2a..472c5b3cb26 100644 --- a/hadoop-common-project/hadoop-common/CHANGES.txt +++ b/hadoop-common-project/hadoop-common/CHANGES.txt @@ -8,6 +8,9 @@ Release 2.4.0 - UNRELEASED IMPROVEMENTS + HADOOP-10139. Update and improve the Single Cluster Setup document. + (Akira Ajisaka via Arpit Agarwal) + OPTIMIZATIONS BUG FIXES diff --git a/hadoop-common-project/hadoop-common/src/site/apt/SingleCluster.apt.vm b/hadoop-common-project/hadoop-common/src/site/apt/SingleCluster.apt.vm index cf8390e4610..ef7532a9e52 100644 --- a/hadoop-common-project/hadoop-common/src/site/apt/SingleCluster.apt.vm +++ b/hadoop-common-project/hadoop-common/src/site/apt/SingleCluster.apt.vm @@ -20,174 +20,267 @@ Hadoop MapReduce Next Generation - Setting up a Single Node Cluster. %{toc|section=1|fromDepth=0} -* Mapreduce Tarball +* Purpose - You should be able to obtain the MapReduce tarball from the release. - If not, you should be able to create a tarball from the source. + This document describes how to set up and configure a single-node Hadoop + installation so that you can quickly perform simple operations using Hadoop + MapReduce and the Hadoop Distributed File System (HDFS). + +* Prerequisites + +** Supported Platforms + + * GNU/Linux is supported as a development and production platform. + Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. + + * Windows is also a supported platform but the followings steps + are for Linux only. To set up Hadoop on Windows, see + {{{http://wiki.apache.org/hadoop/Hadoop2OnWindows}wiki page}}. + +** Required Software + + Required software for Linux include: + + [[1]] Java\u2122 must be installed. Recommended Java versions are described + at {{{http://wiki.apache.org/hadoop/HadoopJavaVersions} + HadoopJavaVersions}}. + + [[2]] ssh must be installed and sshd must be running to use the Hadoop + scripts that manage remote Hadoop daemons. + +** Installing Software + + If your cluster doesn't have the requisite software you will need to install + it. + + For example on Ubuntu Linux: + +---- + $ sudo apt-get install ssh + $ sudo apt-get install rsync +---- + +* Download + + To get a Hadoop distribution, download a recent stable release from one of + the {{{http://www.apache.org/dyn/closer.cgi/hadoop/common/} + Apache Download Mirrors}}. + +* Prepare to Start the Hadoop Cluster + + Unpack the downloaded Hadoop distribution. In the distribution, edit + the file <<>> to define some parameters as + follows: + +---- + # set to the root of your Java installation + export JAVA_HOME=/usr/java/latest + + # Assuming your installation directory is /usr/local/hadoop + export HADOOP_PREFIX=/usr/local/hadoop +---- + + Try the following command: + +---- + $ bin/hadoop +---- + + This will display the usage documentation for the hadoop script. + + Now you are ready to start your Hadoop cluster in one of the three supported + modes: + + * {{{Standalone Operation}Local (Standalone) Mode}} + + * {{{Pseudo-Distributed Operation}Pseudo-Distributed Mode}} + + * {{{Fully-Distributed Operation}Fully-Distributed Mode}} + +* Standalone Operation + + By default, Hadoop is configured to run in a non-distributed mode, as a + single Java process. This is useful for debugging. + + The following example copies the unpacked conf directory to use as input + and then finds and displays every match of the given regular expression. + Output is written to the given output directory. + +---- + $ mkdir input + $ cp etc/hadoop/*.xml input + $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+' + $ cat output/* +---- + +* Pseudo-Distributed Operation + + Hadoop can also be run on a single-node in a pseudo-distributed mode where + each Hadoop daemon runs in a separate Java process. + +** Configuration + + Use the following: + + etc/hadoop/core-site.xml: +---+ -$ mvn clean install -DskipTests -$ cd hadoop-mapreduce-project -$ mvn clean install assembly:assembly -Pnative -+---+ - <> You will need {{{http://code.google.com/p/protobuf}protoc 2.5.0}} - installed. - - To ignore the native builds in mapreduce you can omit the <<<-Pnative>>> argument - for maven. The tarball should be available in <<>> directory. - - -* Setting up the environment. - - Assuming you have installed hadoop-common/hadoop-hdfs and exported - <<$HADOOP_COMMON_HOME>>/<<$HADOOP_HDFS_HOME>>, untar hadoop mapreduce - tarball and set environment variable <<$HADOOP_MAPRED_HOME>> to the - untarred directory. Set <<$HADOOP_YARN_HOME>> the same as <<$HADOOP_MAPRED_HOME>>. - - <> The following instructions assume you have hdfs running. - -* Setting up Configuration. - - To start the ResourceManager and NodeManager, you will have to update the configs. - Assuming your $HADOOP_CONF_DIR is the configuration directory and has the installed - configs for HDFS and <<>>. There are 2 config files you will have to setup - <<>> and <<>>. - -** Setting up <<>> - - Add the following configs to your <<>>. - -+---+ - - mapreduce.cluster.temp.dir - - No description - true - - - - mapreduce.cluster.local.dir - - No description - true - + + + fs.defaultFS + hdfs://localhost:9000 + + +---+ -** Setting up <<>> - -Add the following configs to your <<>> - -+---+ - - yarn.resourcemanager.resource-tracker.address - host:port - host is the hostname of the resource manager and - port is the port on which the NodeManagers contact the Resource Manager. - - - - - yarn.resourcemanager.scheduler.address - host:port - host is the hostname of the resourcemanager and port is the port - on which the Applications in the cluster talk to the Resource Manager. - - - - - yarn.resourcemanager.scheduler.class - org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler - In case you do not want to use the default scheduler - - - - yarn.resourcemanager.address - host:port - the host is the hostname of the ResourceManager and the port is the port on - which the clients can talk to the Resource Manager. - - - - yarn.nodemanager.local-dirs - - the local directories used by the nodemanager - - - - yarn.nodemanager.address - 0.0.0.0:port - the nodemanagers bind to this port - - - - yarn.nodemanager.resource.memory-mb - 10240 - the amount of memory on the NodeManager in GB - - - - yarn.nodemanager.remote-app-log-dir - /app-logs - directory on hdfs where the application logs are moved to - - - - yarn.nodemanager.log-dirs - - the directories used by Nodemanagers as log directories - - - - yarn.nodemanager.aux-services - mapreduce_shuffle - shuffle service that needs to be set for Map Reduce to run - -+---+ - -* Setting up <<>> - - Make sure you populate the root queues in <<>>. - -+---+ - - yarn.scheduler.capacity.root.queues - unfunded,default - - - - yarn.scheduler.capacity.root.capacity - 100 - - - - yarn.scheduler.capacity.root.unfunded.capacity - 50 - - - - yarn.scheduler.capacity.root.default.capacity - 50 - -+---+ - -* Running daemons. - - Assuming that the environment variables <<$HADOOP_COMMON_HOME>>, <<$HADOOP_HDFS_HOME>>, <<$HADOO_MAPRED_HOME>>, - <<$HADOOP_YARN_HOME>>, <<$JAVA_HOME>> and <<$HADOOP_CONF_DIR>> have been set appropriately. - Set $<<$YARN_CONF_DIR>> the same as $<> - - Run ResourceManager and NodeManager as: + etc/hadoop/hdfs-site.xml: +---+ -$ cd $HADOOP_MAPRED_HOME -$ sbin/yarn-daemon.sh start resourcemanager -$ sbin/yarn-daemon.sh start nodemanager + + + dfs.replication + 1 + + +---+ - You should be up and running. You can run randomwriter as: +** Setup passphraseless ssh + + Now check that you can ssh to the localhost without a passphrase: + +---- + $ ssh localhost +---- + + If you cannot ssh to localhost without a passphrase, execute the + following commands: + +---- + $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa + $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys +---- + +** Execution + + The following instructions are to run a MapReduce job locally. + If you want to execute a job on YARN, see {{YARN on Single Node}}. + + [[1]] Format the filesystem: + +---- + $ bin/hdfs namenode -format +---- + + [[2]] Start NameNode daemon and DataNode daemon: + +---- + $ sbin/start-dfs.sh +---- + + The hadoop daemon log output is written to the <<<${HADOOP_LOG_DIR}>>> + directory (defaults to <<<${HADOOP_HOME}/logs>>>). + + [[3]] Browse the web interface for the NameNode; by default it is + available at: + + * NameNode - <<>> + + [[4]] Make the HDFS directories required to execute MapReduce jobs: + +---- + $ bin/hdfs dfs -mkdir /user + $ bin/hdfs dfs -mkdir /user/ +---- + + [[5]] Copy the input files into the distributed filesystem: + +---- + $ bin/hdfs dfs -put etc/hadoop input +---- + + [[6]] Run some of the examples provided: + +---- + $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+' +---- + + [[7]] Examine the output files: + + Copy the output files from the distributed filesystem to the local + filesystem and examine them: + +---- + $ bin/hdfs dfs -get output output + $ cat output/* +---- + + or + + View the output files on the distributed filesystem: + +---- + $ bin/hdfs dfs -cat output/* +---- + + [[8]] When you're done, stop the daemons with: + +---- + $ sbin/stop-dfs.sh +---- + +** YARN on Single Node + + You can run a MapReduce job on YARN in a pseudo-distributed mode by setting + a few parameters and running ResourceManager daemon and NodeManager daemon + in addition. + + The following instructions assume that 1. ~ 4. steps of + {{{Execution}the above instructions}} are already executed. + + [[1]] Configure parameters as follows: + + etc/hadoop/mapred-site.xml: +---+ -$ $HADOOP_COMMON_HOME/bin/hadoop jar hadoop-examples.jar randomwriter out + + + mapreduce.framework.name + yarn + + +---+ -Good luck. + etc/hadoop/yarn-site.xml: + ++---+ + + + yarn.nodemanager.aux-services + mapreduce_shuffle + + ++---+ + + [[2]] Start ResourceManager daemon and NodeManager daemon: + +---- + $ sbin/start-yarn.sh +---- + + [[3]] Browse the web interface for the ResourceManager; by default it is + available at: + + * ResourceManager - <<>> + + [[4]] Run a MapReduce job. + + [[5]] When you're done, stop the daemons with: + +---- + $ sbin/stop-yarn.sh +---- + +* Fully-Distributed Operation + + For information on setting up fully-distributed, non-trivial clusters + see {{{./ClusterSetup.html}Cluster Setup}}.