- This document describes how to configure Hadoop HTTP web-consoles to require user - authentication. -
-- By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow - access without any form of authentication. -
-- Similarly to Hadoop RPC, Hadoop HTTP web-consoles can be configured to require Kerberos - authentication using HTTP SPNEGO protocol (supported by browsers like Firefox and Internet - Explorer). -
-
- In addition, Hadoop HTTP web-consoles support the equivalent of Hadoop's Pseudo/Simple
- authentication. If this option is enabled, user must specify their user name in the first
- browser interaction using the user.name
query string parameter. For example:
- http://localhost:50030/jobtracker.jsp?user.name=babu
.
-
- If a custom authentication mechanism is required for the HTTP web-consoles, it is possible
- to implement a plugin to support the alternate authentication mechanism (refer to
- Hadoop hadoop-auth for details on writing an AuthenticatorHandler
).
-
- The next section describes how to configure Hadoop HTTP web-consoles to require user - authentication. -
-
- The following properties should be in the core-site.xml
of all the nodes
- in the cluster.
-
hadoop.http.filter.initializers
: add to this property the
- org.apache.hadoop.security.AuthenticationFilterInitializer
initializer class.
-
hadoop.http.authentication.type
: Defines authentication used for the HTTP
- web-consoles. The supported values are: simple | kerberos |
- #AUTHENTICATION_HANDLER_CLASSNAME#
. The dfeault value is simple
.
-
hadoop.http.authentication.token.validity
: Indicates how long (in seconds)
- an authentication token is valid before it has to be renewed. The default value is
- 36000
.
-
hadoop.http.authentication.signature.secret.file
: The signature secret
- file for signing the authentication tokens. If not set a random secret is generated at
- startup time. The same secret should be used for all nodes in the cluster, JobTracker,
- NameNode, DataNode and TastTracker. The default value is
- ${user.home}/hadoop-http-auth-signature-secret
.
- IMPORTANT: This file should be readable only by the Unix user running the daemons.
-
hadoop.http.authentication.cookie.domain
: The domain to use for the HTTP
- cookie that stores the authentication token. In order to authentiation to work
- correctly across all nodes in the cluster the domain must be correctly set.
- There is no default value, the HTTP cookie will not have a domain working only
- with the hostname issuing the HTTP cookie.
-
- IMPORTANT: when using IP addresses, browsers ignore cookies with domain settings. - For this setting to work properly all nodes in the cluster must be configured - to generate URLs with hostname.domain names on it. -
- -hadoop.http.authentication.simple.anonymous.allowed
: Indicates if anonymous
- requests are allowed when using 'simple' authentication. The default value is
- true
-
hadoop.http.authentication.kerberos.principal
: Indicates the Kerberos
- principal to be used for HTTP endpoint when using 'kerberos' authentication.
- The principal short name must be HTTP
per Kerberos HTTP SPNEGO specification.
- The default value is HTTP/_HOST@$LOCALHOST
, where _HOST
-if present-
- is replaced with bind address of the HTTP server.
-
hadoop.http.authentication.kerberos.keytab
: Location of the keytab file
- with the credentials for the Kerberos principal used for the HTTP endpoint.
- The default value is ${user.home}/hadoop.keytab
.i
-
This document describes how to install, configure and manage non-trivial - Hadoop clusters ranging from a few nodes to extremely large clusters with - thousands of nodes.
-- To play with Hadoop, you may first want to install Hadoop on a single machine (see Hadoop Quick Start). -
-Installing a Hadoop cluster typically involves unpacking the software - on all the machines in the cluster.
- -Typically one machine in the cluster is designated as the
- NameNode
and another machine the as JobTracker
,
- exclusively. These are the masters. The rest of the machines in
- the cluster act as both DataNode
and
- TaskTracker
. These are the slaves.
The root of the distribution is referred to as
- HADOOP_PREFIX
. All machines in the cluster usually have the same
- HADOOP_PREFIX
path.
The following sections describe how to configure a Hadoop cluster.
- -Hadoop configuration is driven by two types of important - configuration files:
-To learn more about how the Hadoop framework is controlled by these - configuration files, look - here.
- -Additionally, you can control the Hadoop scripts found in the
- bin/
directory of the distribution, by setting site-specific
- values via the conf/hadoop-env.sh
.
To configure the Hadoop cluster you will need to configure the - environment in which the Hadoop daemons execute as well as - the configuration parameters for the Hadoop daemons.
- -The Hadoop daemons are NameNode
/DataNode
- and JobTracker
/TaskTracker
.
Administrators should use the conf/hadoop-env.sh
script
- to do site-specific customization of the Hadoop daemons' process
- environment.
At the very least you should specify the
- JAVA_HOME
so that it is correctly defined on each
- remote node.
Administrators can configure individual daemons using the
- configuration options HADOOP_*_OPTS
. Various options
- available are shown below in the table.
Daemon | Configure Options |
---|---|
NameNode | HADOOP_NAMENODE_OPTS |
DataNode | HADOOP_DATANODE_OPTS |
SecondaryNamenode | -HADOOP_SECONDARYNAMENODE_OPTS |
For example, To configure Namenode to use parallelGC, the
- following statement should be added in hadoop-env.sh
:
-
- export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"
-
Other useful configuration parameters that you can customize - include:
-HADOOP_LOG_DIR
- The directory where the daemons'
- log files are stored. They are automatically created if they don't
- exist.
- HADOOP_HEAPSIZE
- The maximum amount of heapsize
- to use, in MB e.g. 1000MB
. This is used to
- configure the heap size for the hadoop daemon. By default,
- the value is 1000MB
.
- This section deals with important parameters to be specified in the - following:
-conf/core-site.xml
:
Parameter | -Value | -Notes | -
---|---|---|
fs.defaultFS | -URI of NameNode . |
- hdfs://hostname/ | -
conf/hdfs-site.xml
:
Parameter | -Value | -Notes | -
---|---|---|
dfs.namenode.name.dir | -
- Path on the local filesystem where the NameNode
- stores the namespace and transactions logs persistently. |
- - If this is a comma-delimited list of directories then the name - table is replicated in all of the directories, for redundancy. - | -
dfs.datanode.data.dir | -
- Comma separated list of paths on the local filesystem of a
- DataNode where it should store its blocks.
- |
- - If this is a comma-delimited list of directories, then data will - be stored in all named directories, typically on different - devices. - | -
conf/mapred-site.xml
:
Parameter | -Value | -Notes | -
---|---|---|
mapreduce.jobtracker.address | -Host or IP and port of JobTracker . |
- host:port pair. | -
mapreduce.jobtracker.system.dir | -
- Path on the HDFS where where the Map/Reduce framework stores
- system files e.g. /hadoop/mapred/system/ .
- |
- - This is in the default filesystem (HDFS) and must be accessible - from both the server and client machines. - | -
mapreduce.cluster.local.dir | -- Comma-separated list of paths on the local filesystem where - temporary Map/Reduce data is written. - | -Multiple paths help spread disk i/o. | -
mapred.tasktracker.{map|reduce}.tasks.maximum | -
- The maximum number of Map/Reduce tasks, which are run
- simultaneously on a given TaskTracker , individually.
- |
- - Defaults to 2 (2 maps and 2 reduces), but vary it depending on - your hardware. - | -
dfs.hosts/dfs.hosts.exclude | -List of permitted/excluded DataNodes. | -- If necessary, use these files to control the list of allowable - datanodes. - | -
mapreduce.jobtracker.hosts.filename/mapreduce.jobtracker.hosts.exclude.filename | -List of permitted/excluded TaskTrackers. | -- If necessary, use these files to control the list of allowable - TaskTrackers. - | -
mapreduce.cluster.acls.enabled | -Boolean, specifying whether checks for queue ACLs and job ACLs - are to be done for authorizing users for doing queue operations and - job operations. - | -- If true, queue ACLs are checked while submitting - and administering jobs and job ACLs are checked for authorizing - view and modification of jobs. Queue ACLs are specified using the - configuration parameters of the form defined below under - mapred-queues.xml. Job ACLs are described at - mapred-tutorial in "Job Authorization" section. - For enabling this flag(mapreduce.cluster.acls.enabled), this is to be - set to true in mapred-site.xml on JobTracker node and on all - TaskTracker nodes. - | -
Typically all the above parameters are marked as - - final to ensure that they cannot be overriden by user-applications. -
- -conf/mapred-queues.xml
-
:
This file is used to configure the queues in the Map/Reduce - system. Queues are abstract entities in the JobTracker that can be - used to manage collections of jobs. They provide a way for - administrators to organize jobs in specific ways and to enforce - certain policies on such collections, thus providing varying - levels of administrative control and management functions on jobs. -
-One can imagine the following sample scenarios:
-The usage of queues is closely tied to the scheduler configured - at the JobTracker via mapreduce.jobtracker.taskscheduler. - The degree of support of queues depends on the scheduler used. Some - schedulers support a single queue, while others support more complex - configurations. Schedulers also implement the policies that apply - to jobs in a queue. Some schedulers, such as the Fairshare scheduler, - implement their own mechanisms for collections of jobs and do not rely - on queues provided by the framework. The administrators are - encouraged to refer to the documentation of the scheduler they are - interested in for determining the level of support for queues.
-The Map/Reduce framework supports some basic operations on queues - such as job submission to a specific queue, access control for queues, - queue states, viewing configured queues and their properties - and refresh of queue properties. In order to fully implement some of - these operations, the framework takes the help of the configured - scheduler.
-The following types of queue configurations are possible:
-Most of the configuration of the queues can be refreshed/reloaded - without restarting the Map/Reduce sub-system by editing this - configuration file as described in the section on - reloading queue - configuration. - Not all configuration properties can be reloaded of course, - as will description of each property below explain.
- -The format of conf/mapred-queues.xml is different from the other - configuration files, supporting nested configuration - elements to support hierarchical queues. The format is as follows: -
- - -Tag/Attribute | -Value | -- Refresh-able? - | -Notes | -
---|---|---|---|
Root element of the configuration file. | -Not-applicable | -All the queues are nested inside this root element of the - file. There can be only one root queues element in the file. | -|
aclsEnabled | -Boolean attribute to the - <queues> tag - specifying whether ACLs are supported for controlling job - submission and administration for all the queues - configured. - | -Yes | -If false, ACLs are ignored for all the
- configured queues. - If true, the user and group details of the user - are checked against the configured ACLs of the corresponding - job-queue while submitting and administering jobs. ACLs can be - specified for each queue using the queue-specific tags - "acl-$acl_name", defined below. ACLs are checked only against - the job-queues, i.e. the leaf-level queues; ACLs configured - for the rest of the queues in the hierarchy are ignored. - |
-
A child element of the - <queues> tag or another - <queue>. Denotes a queue - in the system. - | -Not applicable | -Queues can be hierarchical and so this element can contain - children of this same type. | -|
name | -Child element of a - <queue> specifying the - name of the queue. | -No | -Name of the queue cannot contain the character ":" - which is reserved as the queue-name delimiter when addressing a - queue in a hierarchy. | -
state | -Child element of a - <queue> specifying the - state of the queue. - | -Yes | -Each queue has a corresponding state. A queue in
- 'running' state can accept new jobs, while a queue in
- 'stopped' state will stop accepting any new jobs. State
- is defined and respected by the framework only for the
- leaf-level queues and is ignored for all other queues.
- - The state of the queue can be viewed from the command line using - 'bin/mapred queue' command and also on the the Web
- UI.- Administrators can stop and start queues at runtime using the - feature of reloading - queue configuration. If a queue is stopped at runtime, it - will complete all the existing running jobs and will stop - accepting any new jobs. - |
-
acl-submit-job | -Child element of a - <queue> specifying the - list of users and groups that can submit jobs to the specified - queue. | -Yes | -
- Applicable only to leaf-queues. - The list of users and groups are both comma separated - list of names. The two lists are separated by a blank. - Example: user1,user2 group1,group2. - If you wish to define only a list of groups, provide - a blank at the beginning of the value. - - |
-
acl-administer-jobs | -Child element of a - <queue> specifying the - list of users and groups that can view job details, change the - priority of a job or kill a job that has been submitted to the - specified queue. - | -Yes | -
- Applicable only to leaf-queues. - The list of users and groups are both comma separated - list of names. The two lists are separated by a blank. - Example: user1,user2 group1,group2. - If you wish to define only a list of groups, provide - a blank at the beginning of the value. Note that the - owner of a job can always change the priority or kill - his/her own job, irrespective of the ACLs. - |
-
Child element of a - <queue> specifying the - scheduler specific properties. | -Not applicable | -The scheduler specific properties are the children of this - element specified as a group of <property> tags described - below. The JobTracker completely ignores these properties. These - can be used as per-queue properties needed by the scheduler - being configured. Please look at the scheduler specific - documentation as to how these properties are used by that - particular scheduler. - | -|
Child element of - <properties> for a - specific queue. | -Not applicable | -A single scheduler specific queue-property. Ignored by - the JobTracker and used by the scheduler that is configured. | -|
key | -Attribute of a - <property> for a - specific queue. | -Scheduler-specific | -The name of a single scheduler specific queue-property. | -
value | -Attribute of a - <property> for a - specific queue. | -Scheduler-specific | -The value of a single scheduler specific queue-property. - The value can be anything that is left for the proper - interpretation by the scheduler that is configured. | -
Once the queues are configured properly and the Map/Reduce - system is up and running, from the command line one can - get the list - of queues and - obtain - information specific to each queue. This information is also - available from the web UI. On the web UI, queue information can be - seen by going to queueinfo.jsp, linked to from the queues table-cell - in the cluster-summary table. The queueinfo.jsp prints the hierarchy - of queues as well as the specific information for each queue. -
- -Users can submit jobs only to a - leaf-level queue by specifying the fully-qualified queue-name for - the property name mapreduce.job.queuename in the job - configuration. The character ':' is the queue-name delimiter and so, - for e.g., if one wants to submit to a configured job-queue 'Queue-C' - which is one of the sub-queues of 'Queue-B' which in-turn is a - sub-queue of 'Queue-A', then the job configuration should contain - property mapreduce.job.queuename set to the - <value>Queue-A:Queue-B:Queue-C</value>
-This section lists some non-default configuration parameters which - have been used to run the sort benchmark on very large - clusters.
- -Some non-default configuration values used to run sort900, - that is 9TB of data sorted on a cluster with 900 nodes:
-Configuration File | -Parameter | -Value | -Notes | -
---|---|---|---|
conf/hdfs-site.xml | -dfs.blocksize | -128m | -- HDFS blocksize of 128 MB for large file-systems. Sizes can be provided - in size-prefixed values (10k, 128m, 1g, etc.) or simply in bytes (134217728 for 128 MB, etc.). - | -
conf/hdfs-site.xml | -dfs.namenode.handler.count | -40 | -- More NameNode server threads to handle RPCs from large - number of DataNodes. - | -
conf/mapred-site.xml | -mapreduce.reduce.shuffle.parallelcopies | -20 | -- Higher number of parallel copies run by reduces to fetch - outputs from very large number of maps. - | -
conf/mapred-site.xml | -mapreduce.map.java.opts | --Xmx512M | -- Larger heap-size for child jvms of maps. - | -
conf/mapred-site.xml | -mapreduce.reduce.java.opts | --Xmx512M | -- Larger heap-size for child jvms of reduces. - | -
conf/mapred-site.xml | -mapreduce.reduce.shuffle.input.buffer.percent | -0.80 | -- Larger amount of memory allocated for merging map output - in memory during the shuffle. Expressed as a fraction of - the total heap. - | -
conf/mapred-site.xml | -mapreduce.reduce.input.buffer.percent | -0.80 | -- Larger amount of memory allocated for retaining map output - in memory during the reduce. Expressed as a fraction of - the total heap. - | -
conf/mapred-site.xml | -mapreduce.task.io.sort.factor | -100 | -More streams merged at once while sorting files. | -
conf/mapred-site.xml | -mapreduce.task.io.sort.mb | -200 | -Higher memory-limit while sorting data. | -
conf/core-site.xml | -io.file.buffer.size | -131072 | -Size of read/write buffer used in SequenceFiles. | -
Updates to some configuration values to run sort1400 and - sort2000, that is 14TB of data sorted on 1400 nodes and 20TB of - data sorted on 2000 nodes:
-Configuration File | -Parameter | -Value | -Notes | -
---|---|---|---|
conf/mapred-site.xml | -mapreduce.jobtracker.handler.count | -60 | -- More JobTracker server threads to handle RPCs from large - number of TaskTrackers. - | -
conf/mapred-site.xml | -mapreduce.reduce.shuffle.parallelcopies | -50 | -- |
conf/mapred-site.xml | -mapreduce.tasktracker.http.threads | -50 | -- More worker threads for the TaskTracker's http server. The - http server is used by reduces to fetch intermediate - map-outputs. - | -
conf/mapred-site.xml | -mapreduce.map.java.opts | --Xmx512M | -- Larger heap-size for child jvms of maps. - | -
conf/mapred-site.xml | -mapreduce.reduce.java.opts | --Xmx1024M | -Larger heap-size for child jvms of reduces. | -
- As MapReduce jobs could use varying amounts of memory, Hadoop - provides various configuration options to users and administrators - for managing memory effectively. Some of these options are job - specific and can be used by users. While setting up a cluster, - administrators can configure appropriate default values for these - options so that users jobs run out of the box. Other options are - cluster specific and can be used by administrators to enforce - limits and prevent misconfigured or memory intensive jobs from - causing undesired side effects on the cluster. -
-- The values configured should - take into account the hardware resources of the cluster, such as the - amount of physical and virtual memory available for tasks, - the number of slots configured on the slaves and the requirements - for other processes running on the slaves. If right values are not - set, it is likely that jobs start failing with memory related - errors or in the worst case, even affect other tasks or - the slaves themselves. -
- -- Before describing the memory options, it is - useful to look at a feature provided by Hadoop to monitor - memory usage of MapReduce tasks it runs. The basic objective - of this feature is to prevent MapReduce tasks from consuming - memory beyond a limit that would result in their affecting - other processes running on the slave, including other tasks - and daemons like the DataNode or TaskTracker. -
- -- Note: For the time being, this feature is available - only for the Linux platform. -
- -- Hadoop allows monitoring to be done both for virtual - and physical memory usage of tasks. This monitoring - can be done independently of each other, and therefore the - options can be configured independently of each other. It - has been found in some environments, particularly related - to streaming, that virtual memory recorded for tasks is high - because of libraries loaded by the programs used to run - the tasks. However, this memory is largely unused and does - not affect the slaves's memory itself. In such cases, - monitoring based on physical memory can provide a more - accurate picture of memory usage. -
- -- This feature considers that there is a limit on - the amount of virtual or physical memory on the slaves - that can be used by - the running MapReduce tasks. The rest of the memory is - assumed to be required for the system and other processes. - Since some jobs may require higher amount of memory for their - tasks than others, Hadoop allows jobs to specify how much - memory they expect to use at a maximum. Then by using - resource aware scheduling and monitoring, Hadoop tries to - ensure that at any time, only enough tasks are running on - the slaves as can meet the dual constraints of an individual - job's memory requirements and the total amount of memory - available for all MapReduce tasks. -
- -- The TaskTracker monitors tasks in regular intervals. Each time, - it operates in two steps: -
- -- In either case, the task's diagnostic message will indicate the - reason why the task was terminated. -
- -- Resource aware scheduling can ensure that tasks are scheduled - on a slave only if their memory requirement can be satisfied - by the slave. The Capacity Scheduler, for example, - takes virtual memory requirements into account while - scheduling tasks, as described in the section on - - memory based scheduling. -
- -- Memory monitoring is enabled when certain configuration - variables are defined with non-zero values, as described below. -
- -- Memory related options that can be configured individually per - job are described in detail in the section on - - Configuring Memory Requirements For A Job in the MapReduce - tutorial. While setting up - the cluster, the Hadoop defaults for these options can be reviewed - and changed to better suit the job profiles expected to be run on - the clusters, as also the hardware configuration. -
-- As with any other configuration option in Hadoop, if the - administrators desire to prevent users from overriding these - options in jobs they submit, these values can be marked as - final in the cluster configuration. -
-- This section describes the memory related options that are - used by the JobTracker and TaskTrackers, and cannot be changed - by jobs. The values set for these options should be the same - for all the slave nodes in a cluster. -
- -mapreduce.cluster.{map|reduce}memory.mb
: These
- options define the default amount of virtual memory that should be
- allocated for MapReduce tasks running in the cluster. They
- typically match the default values set for the options
- mapreduce.{map|reduce}.memory.mb
. They help in the
- calculation of the total amount of virtual memory available for
- MapReduce tasks on a slave, using the following equation:mapreduce.jobtracker.max{map|reduce}memory.mb
:
- These options define the maximum amount of virtual memory that
- can be requested by jobs using the parameters
- mapreduce.{map|reduce}.memory.mb
. The system
- will reject any job that is submitted requesting for more
- memory than these limits. Typically, the values for these
- options should be set to satisfy the following constraint:mapreduce.cluster.reducememory.mb
is set to 2GB and
- there are 2 reduce slots configured in the slaves, the value
- for mapreduce.jobtracker.maxreducememory.mb
should
- be set to 4096.
- mapreduce.tasktracker.reserved.physicalmemory.mb
:
- This option defines the amount of physical memory that is
- marked for system and daemon processes. Using this, the amount
- of physical memory available for MapReduce tasks is calculated
- using the following equation:mapreduce.tasktracker.taskmemorymanager.monitoringinterval
:
- This option defines the time the TaskTracker waits between
- two cycles of memory monitoring. The value is specified in
- milliseconds.
-
- Note: The virtual memory monitoring function is only
- enabled if
- the variables mapreduce.cluster.{map|reduce}memory.mb
- and mapreduce.jobtracker.max{map|reduce}memory.mb
- are set to values greater than zero. Likewise, the physical
- memory monitoring function is only enabled if the variable
- mapreduce.tasktracker.reserved.physicalmemory.mb
- is set to a value greater than zero.
-
Task controllers are classes in the Hadoop Map/Reduce - framework that define how user's map and reduce tasks - are launched and controlled. They can - be used in clusters that require some customization in - the process of launching or controlling the user tasks. - For example, in some - clusters, there may be a requirement to run tasks as - the user who submitted the job, instead of as the task - tracker user, which is how tasks are launched by default. - This section describes how to configure and use - task controllers.
-The following task controllers are the available in - Hadoop. -
-Name | Class Name | Description |
---|---|---|
DefaultTaskController | -org.apache.hadoop.mapred.DefaultTaskController | -The default task controller which Hadoop uses to manage task - execution. The tasks run as the task tracker user. | -
LinuxTaskController | -org.apache.hadoop.mapred.LinuxTaskController | -This task controller, which is supported only on Linux, - runs the tasks as the user who submitted the job. It requires - these user accounts to be created on the cluster nodes - where the tasks are launched. It - uses a setuid executable that is included in the Hadoop - distribution. The task tracker uses this executable to - launch and kill tasks. The setuid executable switches to - the user who has submitted the job and launches or kills - the tasks. For maximum security, this task controller - sets up restricted permissions and user/group ownership of - local files and directories used by the tasks such as the - job jar files, intermediate files, task log files and distributed - cache files. Particularly note that, because of this, except the - job owner and tasktracker, no other user can access any of the - local files/directories including those localized as part of the - distributed cache. - | -
The task controller to be used can be configured by setting the - value of the following key in mapred-site.xml
-Property | Value | Notes | -
---|---|---|
mapreduce.tasktracker.taskcontroller | -Fully qualified class name of the task controller class | -Currently there are two implementations of task controller - in the Hadoop system, DefaultTaskController and LinuxTaskController. - Refer to the class names mentioned above to determine the value - to set for the class of choice. - | -
This section of the document describes the steps required to - use the LinuxTaskController.
- -In order to use the LinuxTaskController, a setuid executable - should be built and deployed on the compute nodes. The - executable is named task-controller. To build the executable, - execute - ant task-controller -Dhadoop.conf.dir=/path/to/conf/dir. - - The path passed in -Dhadoop.conf.dir should be the path - on the cluster nodes where a configuration file for the setuid - executable would be located. The executable would be built to - build.dir/dist.dir/bin and should be installed to - $HADOOP_PREFIX/bin. -
- -- The executable must have specific permissions as follows. The - executable should have 6050 or --Sr-s--- permissions - user-owned by root(super-user) and group-owned by a special group - of which the TaskTracker's user is the group member and no job - submitter is. If any job submitter belongs to this special group, - security will be compromised. This special group name should be - specified for the configuration property - "mapreduce.tasktracker.group" in both mapred-site.xml and - task-controller.cfg. - For example, let's say that the TaskTracker is run as user - mapred who is part of the groups users and - specialGroup any of them being the primary group. - Let also be that users has both mapred and - another user (job submitter) X as its members, and X does - not belong to specialGroup. Going by the above - description, the setuid/setgid executable should be set - 6050 or --Sr-s--- with user-owner as mapred and - group-owner as specialGroup which has - mapred as its member(and not users which has - X also as its member besides mapred). -
- -- The LinuxTaskController requires that paths including and leading up - to the directories specified in - mapreduce.cluster.local.dir and hadoop.log.dir to - be set 755 permissions. -
- -The executable requires a configuration file called - taskcontroller.cfg to be - present in the configuration directory passed to the ant target - mentioned above. If the binary was not built with a specific - conf directory, the path defaults to - /path-to-binary/../conf. The configuration file must be - owned by the user running TaskTracker (user mapred in the - above example), group-owned by anyone and should have the - permissions 0400 or r--------. -
- -The executable requires following configuration items to be - present in the taskcontroller.cfg file. The items should - be mentioned as simple key=value pairs. -
-Name | Description |
---|---|
mapreduce.cluster.local.dir | -Path to mapreduce.cluster.local.directories. Should be same as the value - which was provided to key in mapred-site.xml. This is required to - validate paths passed to the setuid executable in order to prevent - arbitrary paths being passed to it. | -
hadoop.log.dir | -Path to hadoop log directory. Should be same as the value which - the TaskTracker is started with. This is required to set proper - permissions on the log files so that they can be written to by the user's - tasks and read by the TaskTracker for serving on the web UI. | -
mapreduce.tasktracker.group | -Group to which the TaskTracker belongs. The group owner of the - taskcontroller binary should be this group. Should be same as - the value with which the TaskTracker is configured. This - configuration is required for validating the secure access of the - task-controller binary. | -
Hadoop Map/Reduce provides a mechanism by which administrators - can configure the TaskTracker to run an administrator supplied - script periodically to determine if a node is healthy or not. - Administrators can determine if the node is in a healthy state - by performing any checks of their choice in the script. If the - script detects the node to be in an unhealthy state, it must print - a line to standard output beginning with the string ERROR. - The TaskTracker spawns the script periodically and checks its - output. If the script's output contains the string ERROR, - as described above, the node's status is reported as 'unhealthy' - and the node is black-listed on the JobTracker. No further tasks - will be assigned to this node. However, the - TaskTracker continues to run the script, so that if the node - becomes healthy again, it will be removed from the blacklisted - nodes on the JobTracker automatically. The node's health - along with the output of the script, if it is unhealthy, is - available to the administrator in the JobTracker's web interface. - The time since the node was healthy is also displayed on the - web interface. -
- -The following parameters can be used to control the node health - monitoring script in mapred-site.xml.
-Name | Description |
---|---|
mapreduce.tasktracker.healthchecker.script.path |
- Absolute path to the script which is periodically run by the - TaskTracker to determine if the node is - healthy or not. The file should be executable by the TaskTracker. - If the value of this key is empty or the file does - not exist or is not executable, node health monitoring - is not started. | -
mapreduce.tasktracker.healthchecker.interval |
- Frequency at which the node health script is run, - in milliseconds | -
mapreduce.tasktracker.healthchecker.script.timeout |
- Time after which the node health script will be killed by - the TaskTracker if unresponsive. - The node is marked unhealthy. if node health script times out. | -
mapreduce.tasktracker.healthchecker.script.args |
- Extra arguments that can be passed to the node health script - when launched. - These should be comma separated list of arguments. | -
Typically you choose one machine in the cluster to act as the
- NameNode
and one machine as to act as the
- JobTracker
, exclusively. The rest of the machines act as
- both a DataNode
and TaskTracker
and are
- referred to as slaves.
List all slave hostnames or IP addresses in your
- conf/slaves
file, one per line.
Hadoop uses the Apache
- log4j via the Apache
- Commons Logging framework for logging. Edit the
- conf/log4j.properties
file to customize the Hadoop
- daemons' logging configuration (log-formats and so on).
The job history files are stored in central location
- mapreduce.jobtracker.jobhistory.location
which can be on DFS also,
- whose default value is ${HADOOP_LOG_DIR}/history
.
- The history web UI is accessible from job tracker web UI.
The history files are also logged to user specified directory
- mapreduce.job.userhistorylocation
- which defaults to job output directory. The files are stored in
- "_logs/history/" in the specified directory. Hence, by default
- they will be in "mapreduce.output.fileoutputformat.outputdir/_logs/history/". User can stop
- logging by giving the value none
for
- mapreduce.job.userhistorylocation
User can view the history logs summary in specified directory
- using the following command
- $ bin/hadoop job -history output-dir
- This command will print job details, failed and killed tip
- details.
- More details about the job such as successful tasks and
- task attempts made for each task can be viewed using the
- following command
- $ bin/hadoop job -history all output-dir
Once all the necessary configuration is complete, distribute the files
- to the HADOOP_CONF_DIR
directory on all the machines,
- typically ${HADOOP_PREFIX}/conf
.
The job tracker restart can recover running jobs if
- mapreduce.jobtracker.restart.recover
is set true and
- JobHistory logging is enabled. Also
- mapreduce.jobtracker.jobhistory.block.size
value should be
- set to an optimal value to dump job history to disk as soon as
- possible, the typical value is 3145728(3MB).
- Both HDFS and Map/Reduce components are rack-aware. HDFS block placement will use rack - awareness for fault tolerance by placing one block replica on a different rack. This provides - data availability in the event of a network switch failure within the cluster. The jobtracker uses rack - awareness to reduce network transfers of HDFS data blocks by attempting to schedule tasks on datanodes with a local - copy of needed HDFS blocks. If the tasks cannot be scheduled on the datanodes - containing the needed HDFS blocks, then the tasks will be scheduled on the same rack to reduce network transfers if possible. -
-The NameNode and the JobTracker obtain the rack id of the cluster slaves by invoking either - an external script or java class as specified by configuration files. Using either the - java class or external script for topology, output must adhere to the java - DNSToSwitchMapping - interface. The interface expects a one-to-one correspondence to be maintained - and the topology information in the format of '/myrack/myhost', where '/' is the topology - delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host. Assuming - a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a - unique rack-host topology mapping. -
-
- To use the java class for topology mapping, the class name is specified by the
- 'topology.node.switch.mapping.impl'
parameter in the configuration file.
- An example, NetworkTopology.java, is included with the hadoop distribution and can be customized
- by the hadoop administrator. If not included with your distribution, NetworkTopology.java can also be found in the Hadoop
-
- subversion tree. Using a java class instead of an external script has a slight performance benefit in
- that it doesn't need to fork an external process when a new slave node registers itself with the jobtracker or namenode.
- As this class is only used during slave node registration, the performance benefit is limited.
-
- If implementing an external script, it will be specified with the
- topology.script.file.name
parameter in the configuration files. Unlike the java
- class, the external topology script is not included with the Hadoop distribution and is provided by the
- administrator. Hadoop will send multiple IP addresses to ARGV when forking the topology script. The
- number of IP addresses sent to the topology script is controlled with net.topology.script.number.args
- and defaults to 100. If net.topology.script.number.args
was changed to 1, a topology script would
- get forked for each IP submitted by datanodes and/or tasktrackers. Below are example topology scripts.
-
- If topology.script.file.name
or topology.node.switch.mapping.impl
is
- not set, the rack id '/default-rack' is returned for any passed IP address.
- While this behavior appears desirable, it can cause issues with HDFS block replication as
- default behavior is to write one replicated block off rack and is unable to do so as there is
- only a single rack named '/default-rack'.
-
- An additional configuration setting is mapred.cache.task.levels
which determines
- the number of levels (in the network topology) of caches. So, for example, if it is the
- default value of 2, two levels of caches will be constructed - one for hosts
- (host -> task mapping) and another for racks (rack -> task mapping). Giving us our one-to-one
- mapping of '/myrack/myhost'
-
To start a Hadoop cluster you will need to start both the HDFS and - Map/Reduce cluster.
- -
- Format a new distributed filesystem:
- $ bin/hadoop namenode -format
-
- Start the HDFS with the following command, run on the designated
- NameNode
:
- $ bin/start-dfs.sh
-
The bin/start-dfs.sh
script also consults the
- ${HADOOP_CONF_DIR}/slaves
file on the NameNode
- and starts the DataNode
daemon on all the listed slaves.
- Start Map-Reduce with the following command, run on the designated
- JobTracker
:
- $ bin/start-mapred.sh
-
The bin/start-mapred.sh
script also consults the
- ${HADOOP_CONF_DIR}/slaves
file on the JobTracker
- and starts the TaskTracker
daemon on all the listed slaves.
-
- Stop HDFS with the following command, run on the designated
- NameNode
:
- $ bin/stop-dfs.sh
-
The bin/stop-dfs.sh
script also consults the
- ${HADOOP_CONF_DIR}/slaves
file on the NameNode
- and stops the DataNode
daemon on all the listed slaves.
- Stop Map/Reduce with the following command, run on the designated
- the designated JobTracker
:
- $ bin/stop-mapred.sh
-
The bin/stop-mapred.sh
script also consults the
- ${HADOOP_CONF_DIR}/slaves
file on the JobTracker
- and stops the TaskTracker
daemon on all the listed slaves.
- All Hadoop commands are invoked by the bin/hadoop script. Running the Hadoop - script without any arguments prints the description for all commands. -
-
- Usage: hadoop [--config confdir] [COMMAND] [GENERIC_OPTIONS] [COMMAND_OPTIONS]
-
- Hadoop has an option parsing framework that employs parsing generic options as well as running classes. -
-COMMAND_OPTION | Description |
---|---|
--config confdir |
- Overwrites the default Configuration directory. Default is ${HADOOP_PREFIX}/conf. | -
GENERIC_OPTIONS |
- The common set of options supported by multiple commands. | -
COMMAND COMMAND_OPTIONS |
- Various commands with their options are described in the following sections. The commands - have been grouped into User Commands - and Administration Commands. | -
- The following options are supported by dfsadmin, - fs, fsck, - job and fetchdt. - Applications should implement - Tool to support - - GenericOptions. -
-GENERIC_OPTION | Description |
---|---|
-conf <configuration file> |
- Specify an application configuration file. | -
-D <property=value> |
- Use value for given property. | -
-fs <local|namenode:port> |
- Specify a namenode. | -
-jt <local|jobtracker:port> |
- Specify a job tracker. Applies only to job. | -
-files <comma separated list of files> |
- Specify comma separated files to be copied to the map reduce cluster. - Applies only to job. | -
-libjars <comma seperated list of jars> |
- Specify comma separated jar files to include in the classpath. - Applies only to job. | -
-archives <comma separated list of archives> |
- Specify comma separated archives to be unarchived on the compute machines. - Applies only to job. | -
Commands useful for users of a Hadoop cluster.
-- Creates a Hadoop archive. More information see the Hadoop Archives Guide. -
-
- Usage: hadoop archive -archiveName NAME <src>* <dest>
-
COMMAND_OPTION | Description |
---|---|
-archiveName NAME |
- Name of the archive to be created. | -
src |
- Filesystem pathnames which work as usual with regular expressions. | -
dest |
- Destination directory which would contain the archive. | -
- Copy file or directories recursively. More information can be found at DistCp Guide. -
-
- Usage: hadoop distcp <srcurl> <desturl>
-
COMMAND_OPTION | Description |
---|---|
srcurl |
- Source Url | -
desturl |
- Destination Url | -
- Runs a generic filesystem user client. -
-
- Usage: hadoop fs [
GENERIC_OPTIONS]
- [COMMAND_OPTIONS]
-
- The various COMMAND_OPTIONS can be found at - File System Shell Guide. -
-- Runs a HDFS filesystem checking utility. See Fsck for more info. -
-Usage: hadoop fsck [
GENERIC_OPTIONS]
- <path> [-move | -delete | -openforwrite] [-files [-blocks
- [-locations | -racks]]]
COMMAND_OPTION | Description |
---|---|
<path> |
- Start checking from this path. | -
-move |
- Move corrupted files to /lost+found | -
-delete |
- Delete corrupted files. | -
-openforwrite |
- Print out files opened for write. | -
-files |
- Print out files being checked. | -
-blocks |
- Print out block report. | -
-locations |
- Print out locations for every block. | -
-racks |
- Print out network topology for data-node locations. | -
- Gets Delegation Token from a NameNode. See fetchdt for more info. -
-Usage: hadoop fetchdt [
GENERIC_OPTIONS]
- [--webservice <namenode_http_addr>] <file_name>
COMMAND_OPTION | Description |
---|---|
<file_name> |
- File name to store the token into. | -
--webservice <https_address> |
- use http protocol instead of RPC | -
- Runs a jar file. Users can bundle their Map Reduce code in a jar file and execute it using this command. -
-
- Usage: hadoop jar <jar> [mainClass] args...
-
- The streaming jobs are run via this command. For examples, see - Hadoop Streaming. -
-- The WordCount example is also run using jar command. For examples, see the - MapReduce Tutorial. -
-- Command to interact with Map Reduce Jobs. -
-
- Usage: hadoop job [
GENERIC_OPTIONS]
- [-submit <job-file>] | [-status <job-id>] |
- [-counter <job-id> <group-name> <counter-name>] | [-kill <job-id>] |
- [-events <job-id> <from-event-#> <#-of-events>] | [-history [all] <historyFile>] |
- [-list [all]] | [-kill-task <task-id>] | [-fail-task <task-id>] |
- [-set-priority <job-id> <priority>]
-
COMMAND_OPTION | Description |
---|---|
-submit <job-file> |
- Submits the job. | -
-status <job-id> |
- Prints the map and reduce completion percentage and all job counters. | -
-counter <job-id> <group-name> <counter-name> |
- Prints the counter value. | -
-kill <job-id> |
- Kills the job. | -
-events <job-id> <from-event-#> <#-of-events> |
- Prints the events' details received by jobtracker for the given range. | -
-history [all] <historyFile> |
- -history <historyFile> prints job details, failed and killed tip details. More details - about the job such as successful tasks and task attempts made for each task can be viewed by - specifying the [all] option. | -
-list [all] |
- -list all displays all jobs. -list displays only jobs which are yet to complete. | -
-kill-task <task-id> |
- Kills the task. Killed tasks are NOT counted against failed attempts. | -
-fail-task <task-id> |
- Fails the task. Failed tasks are counted against failed attempts. | -
-set-priority <job-id> <priority> |
- Changes the priority of the job. - Allowed priority values are VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW | -
- Runs a pipes job. -
-
- Usage: hadoop pipes [-conf <path>] [-jobconf <key=value>, <key=value>, ...]
- [-input <path>] [-output <path>] [-jar <jar file>] [-inputformat <class>]
- [-map <class>] [-partitioner <class>] [-reduce <class>] [-writer <class>]
- [-program <executable>] [-reduces <num>]
-
COMMAND_OPTION | Description |
---|---|
-conf <path> |
- Configuration for job | -
-jobconf <key=value>, <key=value>, ... |
- Add/override configuration for job | -
-input <path> |
- Input directory | -
-output <path> |
- Output directory | -
-jar <jar file> |
- Jar filename | -
-inputformat <class> |
- InputFormat class | -
-map <class> |
- Java Map class | -
-partitioner <class> |
- Java Partitioner | -
-reduce <class> |
- Java Reduce class | -
-writer <class> |
- Java RecordWriter | -
-program <executable> |
- Executable URI | -
-reduces <num> |
- Number of reduces | -
- command to interact and view Job Queue information -
-
- Usage : hadoop queue [-list] | [-info <job-queue-name> [-showJobs]] | [-showacls]
-
COMMAND_OPTION | Description | -
---|---|
-list |
- Gets list of Job Queues configured in the system. Along with scheduling information - associated with the job queues. - | -
-info <job-queue-name> [-showJobs] |
- - Displays the job queue information and associated scheduling information of particular - job queue. If -showJobs options is present a list of jobs submitted to the particular job - queue is displayed. - | -
-showacls |
- Displays the queue name and associated queue operations allowed for the current user. - The list consists of only those queues to which the user has access. - | -
- Prints the version. -
-
- Usage: hadoop version
-
- Hadoop script can be used to invoke any class. -
-- Runs the class named CLASSNAME. -
- -
- Usage: hadoop CLASSNAME
-
Commands useful for administrators of a Hadoop cluster.
-- Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the - rebalancing process. For more details see - Rebalancer. -
-
- Usage: hadoop balancer [-policy <blockpool|datanode>] [-threshold <threshold>]
-
COMMAND_OPTION | Description |
---|---|
-policy <blockpool|datanode> |
- The balancing policy.
- datanode : Cluster is balance if the disk usage of each datanode is balance.
- blockpool : Cluster is balance if the disk usage of each block pool in each datanode is balance.
- Note that blockpool is a condition stronger than datanode .
- The default policy is datanode .
- |
-
-threshold <threshold> |
- Percentage of disk capacity. This default threshold is 10%. | -
- Get/Set the log level for each daemon. -
-
- Usage: hadoop daemonlog -getlevel <host:port> <name>
- Usage: hadoop daemonlog -setlevel <host:port> <name> <level>
-
COMMAND_OPTION | Description |
---|---|
-getlevel <host:port> <name> |
- Prints the log level of the daemon running at <host:port>. - This command internally connects to http://<host:port>/logLevel?log=<name> | -
-setlevel <host:port> <name> <level> |
- Sets the log level of the daemon running at <host:port>. - This command internally connects to http://<host:port>/logLevel?log=<name> | -
- Runs a HDFS datanode. -
-
- Usage: hadoop datanode [-rollback]
-
COMMAND_OPTION | Description |
---|---|
-rollback |
- Rollsback the datanode to the previous version. This should be used after stopping the datanode - and distributing the old Hadoop version. | -
- Runs a HDFS dfsadmin client. -
-
- Usage: hadoop dfsadmin [
GENERIC_OPTIONS] [-report] [-safemode enter | leave | get | wait] [-refreshNodes]
- [-finalizeUpgrade] [-upgradeProgress status | details | force] [-metasave filename]
- [-setQuota <quota> <dirname>...<dirname>] [-clrQuota <dirname>...<dirname>]
- [-restoreFailedStorage true|false|check]
- [-help [cmd]]
-
COMMAND_OPTION | Description |
---|---|
-report |
- Reports basic filesystem information and statistics. | -
-safemode enter | leave | get | wait |
- Safe mode maintenance command.
- Safe mode is a Namenode state in which it - 1. does not accept changes to the name space (read-only) - 2. does not replicate or delete blocks. - Safe mode is entered automatically at Namenode startup, and - leaves safe mode automatically when the configured minimum - percentage of blocks satisfies the minimum replication - condition. Safe mode can also be entered manually, but then - it can only be turned off manually as well. |
-
-refreshNodes |
- Re-read the hosts and exclude files to update the set - of Datanodes that are allowed to connect to the Namenode - and those that should be decommissioned or recommissioned. | -
-finalizeUpgrade |
- Finalize upgrade of HDFS. - Datanodes delete their previous version working directories, - followed by Namenode doing the same. - This completes the upgrade process. | -
-printTopology |
- Print a tree of the rack/datanode topology of the - cluster as seen by the NameNode. | -
-upgradeProgress status | details | force |
- Request current distributed upgrade status, - a detailed status or force the upgrade to proceed. | -
-metasave filename |
- Save Namenode's primary data structures
- to <filename> in the directory specified by hadoop.log.dir property.
- <filename> will contain one line for each of the following - 1. Datanodes heart beating with Namenode - 2. Blocks waiting to be replicated - 3. Blocks currrently being replicated - 4. Blocks waiting to be deleted |
-
-setQuota <quota> <dirname>...<dirname> |
- Set the quota <quota> for each directory <dirname>.
- The directory quota is a long integer that puts a hard limit on the number of names in the directory tree. - Best effort for the directory, with faults reported if - 1. N is not a positive integer, or - 2. user is not an administrator, or - 3. the directory does not exist or is a file, or - 4. the directory would immediately exceed the new quota. |
-
-clrQuota <dirname>...<dirname> |
- Clear the quota for each directory <dirname>. - Best effort for the directory. with fault reported if - 1. the directory does not exist or is a file, or - 2. user is not an administrator. - It does not fault if the directory has no quota. |
-
-restoreFailedStorage true | false | check |
- This option will turn on/off automatic attempt to restore failed storage replicas. - If a failed storage becomes available again the system will attempt to restore - edits and/or fsimage during checkpoint. 'check' option will return current setting. | -
-help [cmd] |
- Displays help for the given command or all commands if none - is specified. | -
Runs MR admin client
-Usage: hadoop mradmin [
- GENERIC_OPTIONS
- ] [-refreshServiceAcl] [-refreshQueues] [-refreshNodes] [-help [cmd]]
COMMAND_OPTION | Description | -
---|---|
-refreshServiceAcl |
- Reload the service-level authorization policies. Jobtracker - will reload the authorization policy file. | -
-refreshQueues |
- Reload the queues' configuration at the JobTracker. - Most of the configuration of the queues can be refreshed/reloaded - without restarting the Map/Reduce sub-system. Administrators - typically own the - - conf/mapred-queues.xml - file, can edit it while the JobTracker is still running, and can do - a reload by running this command. -It should be noted that while trying to refresh queues' - configuration, one cannot change the hierarchy of queues itself. - This means no operation that involves a change in either the - hierarchy structure itself or the queues' names will be allowed. - Only selected properties of queues can be changed during refresh. - For example, new queues cannot be added dynamically, neither can an - existing queue be deleted. -If during a reload of queue configuration, - a syntactic or semantic error in made during the editing of the - configuration file, the refresh command fails with an exception that - is printed on the standard output of this command, thus informing the - requester with any helpful messages of what has gone wrong during - the edit/reload. Importantly, the existing queue configuration is - untouched and the system is left in a consistent state. - -As described in the - - conf/mapred-queues.xml section, the - - <properties> tag in the queue configuration file can - also be used to specify per-queue properties needed by the scheduler. - When the framework's queue configuration is reloaded using this - command, this scheduler specific configuration will also be reloaded - , provided the scheduler being configured supports this reload. - Please see the documentation of the particular scheduler in use. - |
-
-refreshNodes |
- Refresh the hosts information at the jobtracker. | -
-help [cmd] |
- Displays help for the given command or all commands if none - is specified. | -
- Runs the MapReduce job Tracker node. -
-
- Usage: hadoop jobtracker [-dumpConfiguration]
-
COMMAND_OPTION | Description | -
---|---|
-dumpConfiguration |
- Dumps the configuration used by the JobTracker alongwith queue - configuration in JSON format into Standard output used by the - jobtracker and exits. | -
- Runs the namenode. For more information about upgrade, rollback and finalize see - Upgrade and Rollback. -
-
- Usage: hadoop namenode [-format [-force] [-nonInteractive] [-clusterid someid]] | [-upgrade] | [-rollback] | [-finalize] | [-importCheckpoint] | [-checkpoint] | [-backup]
-
COMMAND_OPTION | Description |
---|---|
-regular |
- Start namenode in standard, active role rather than as backup or checkpoint node. This is the default role. | -
-checkpoint |
- Start namenode in checkpoint role, creating periodic checkpoints of the active namenode metadata. | -
-backup |
- Start namenode in backup role, maintaining an up-to-date in-memory copy of the namespace and creating periodic checkpoints. | -
-format [-force] [-nonInteractive] [-clusterid someid] |
- Formats the namenode. It starts the namenode, formats it and then shuts it down. User will be prompted before formatting any non empty name directories in the local filesystem. - -nonInteractive: User will not be prompted for input if non empty name directories exist in the local filesystem and the format will fail. - -force: Formats the namenode and the user will NOT be prompted to confirm formatting of the name directories in the local filesystem. If -nonInteractive option is specified it will be ignored. - -clusterid: Associates the namenode with the id specified. When formatting federated namenodes use this option to make sure all namenodes are associated with the same id. |
-
-upgrade |
- Namenode should be started with upgrade option after the distribution of new Hadoop version. | -
-rollback |
- Rollsback the namenode to the previous version. This should be used after stopping the cluster - and distributing the old Hadoop version. | -
-finalize |
- Finalize will remove the previous state of the files system. Recent upgrade will become permanent. - Rollback option will not be available anymore. After finalization it shuts the namenode down. | -
-importCheckpoint |
- Loads image from a checkpoint directory and saves it into the current one. Checkpoint directory - is read from property dfs.namenode.checkpoint.dir - (see Import Checkpoint). - | -
-checkpoint |
- Enables checkpointing - (see Checkpoint Node). | -
-backup |
- Enables checkpointing and maintains an in-memory, up-to-date copy of the file system namespace - (see Backup Node). | -
- Runs the HDFS secondary - namenode. See Secondary NameNode - for more info. -
-
- Usage: hadoop secondarynamenode [-checkpoint [force]] | [-geteditsize]
-
COMMAND_OPTION | Description |
---|---|
-checkpoint [force] |
- Checkpoints the Secondary namenode if EditLog size >= dfs.namenode.checkpoint.size. - If -force is used, checkpoint irrespective of EditLog size. | -
-geteditsize |
- Prints the EditLog size. | -
- Runs a MapReduce task Tracker node. -
-
- Usage: hadoop tasktracker
-
- The File System (FS) shell includes various shell-like commands that directly - interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, - such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
- - - -- All FS shell commands take path URIs as arguments. The URI - format is scheme://autority/path. For HDFS the scheme - is hdfs, and for the Local FS the scheme - is file. The scheme and authority are optional. If not - specified, the default scheme specified in the configuration is - used. An HDFS file or directory such as /parent/child - can be specified as hdfs://namenodehost/parent/child or - simply as /parent/child (given that your configuration - is set to point to hdfs://namenodehost). -
-- Most of the commands in FS shell behave like corresponding Unix - commands. Differences are described with each of the - commands. Error information is sent to stderr and the - output is sent to stdout. -
- - - -
- Usage: hdfs dfs -cat URI [URI …]
-
- Copies source paths to stdout. -
-Example:
- hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
-
- hdfs dfs -cat file:///file3 /user/hadoop/file4
- Exit Code:
- Returns 0 on success and -1 on error.
- Usage: hdfs dfs -chgrp [-R] GROUP URI [URI …]
-
- Change group association of files. With -R
, make the change recursively through the directory structure.
- The user must be the owner of files, or else a super-user.
- Additional information is in the HDFS Permissions Guide.
-
- Usage: hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI …]
-
- Change the permissions of files. With -R
, make the change recursively through the directory structure.
- The user must be the owner of the file, or else a super-user.
- Additional information is in the HDFS Permissions Guide.
-
- Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
-
- Change the owner of files. With -R
, make the change recursively through the directory structure.
- The user must be a super-user.
- Additional information is in the HDFS Permissions Guide.
-
- Usage: hdfs dfs -copyFromLocal <localsrc> URI
-
Similar to put command, except that the source is restricted to a local file reference.
-
- Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
-
Similar to get command, except that the destination is restricted to a local file reference.
-
- Usage: hdfs dfs -count [-q] <paths>
-
- Count the number of directories, files and bytes under the paths that match the specified file pattern.
- The output columns with -count
are:
- DIR_COUNT, FILE_COUNT, CONTENT_SIZE FILE_NAME
- The output columns with -count -q
are:
- QUOTA, REMAINING_QUATA, SPACE_QUOTA, REMAINING_SPACE_QUOTA,
- DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME
-
Example:
- hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
-
- hdfs dfs -count -q hdfs://nn1.example.com/file1
-
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -cp URI [URI …] <dest>
-
- Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
-
- Example:
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
- hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -du [-s] [-h] URI [URI …]
-
- Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
-Options:
--s
option will result in an aggregate summary of file lengths being displayed, rather than the individual files.-h
option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
- Example:hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
- Exit Code: Returns 0 on success and -1 on error.
- Usage: hdfs dfs -dus <args>
-
- Displays a summary of file lengths. This is an alternate form of hdfs dfs -du -s
.
-
- Usage: hdfs dfs -expunge
-
Empty the Trash. Refer to the HDFS Architecture Guide - for more information on the Trash feature.
-
- Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
-
-
- Copy files to the local file system. Files that fail the CRC check may be copied with the
- -ignorecrc
option. Files and CRCs may be copied using the
- -crc
option.
-
Example:
- hdfs dfs -get /user/hadoop/file localfile
- hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -getmerge [-nl] <src> <localdst>
-
- Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
- Optionally -nl
flag can be set to enable adding a newline character at the end of each file during merge.
-
- Usage: hdfs dfs -ls [-d] [-h] [-R] <args>
-
For a file returns stat on the file with the following format:
-
- permissions number_of_replicas userid groupid filesize modification_date modification_time filename
-
For a directory it returns list of its direct children as in unix.A directory is listed as:
-
- permissions userid groupid modification_date modification_time dirname
-
Options:
--d
Directories are listed as plain files-h
Format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)-R
Recursively list subdirectories encounteredExample:
-
- hdfs dfs -ls /user/hadoop/file1
-
Exit Code:
-
- Returns 0 on success and -1 on error.
-
Usage: hdfs dfs -lsr <args>
- Recursive version of ls
. Similar to Unix ls -R
.
-
- Usage: hdfs dfs -mkdir <paths>
-
-
- Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. -
-Example:
-hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
- hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
-
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: dfs -moveFromLocal <localsrc> <dst>
-
Similar to put command, except that the source localsrc
is deleted after it's copied.
- Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>
-
Displays a "Not implemented yet" message.
-
- Usage: hdfs dfs -mv URI [URI …] <dest>
-
- Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory.
- Moving files across file systems is not permitted.
-
- Example:
-
hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2
- hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -put <localsrc> ... <dst>
-
Copy single src, or multiple srcs from local file system to the destination file system.
- Also reads input from stdin and writes to destination file system.
-
hdfs dfs -put localfile /user/hadoop/hadoopfile
- hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir
- hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
- hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile
Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -rm [-skipTrash] URI [URI …]
-
- Delete files specified as args. Only deletes files. If the -skipTrash
option
- is specified, the trash, if enabled, will be bypassed and the specified file(s) deleted immediately. This can be
- useful when it is necessary to delete files from an over-quota directory.
- Use -rm -r or rmr for recursive deletes.
- Example:
-
hdfs dfs -rm hdfs://nn.example.com/file
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -rmr [-skipTrash] URI [URI …]
-
Recursive version of delete. The rmr command recursively deletes the directory and any content under it. If the -skipTrash
option
- is specified, the trash, if enabled, will be bypassed and the specified file(s) deleted immediately. This can be
- useful when it is necessary to delete files from an over-quota directory.
- Example:
-
hdfs dfs -rmr /user/hadoop/dir
- hdfs dfs -rmr hdfs://nn.example.com/user/hadoop/dir
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -setrep [-R] <path>
-
- Changes the replication factor of a file. -R option is for recursively increasing the replication factor of files within a directory. -
-Example:
- hdfs dfs -setrep -w 3 -R /user/hadoop/dir1
- Exit Code:
-
- Returns 0 on success and -1 on error.
-
- Usage: hdfs dfs -stat [format] URI [URI …]
-
Print statistics about the file/directory matching the given URI pattern in the specified format.
-Format accepts:
-Example:
- hdfs dfs -stat path
- hdfs dfs -stat %y path
- hdfs dfs -stat '%b %r' path
- Exit Code:
- Returns 0 on success and -1 on error.
- Usage: hdfs dfs -tail [-f] URI
-
- Displays last kilobyte of the file to stdout. -f option can be used as in Unix. -
-Example:
- hdfs dfs -tail pathname
- Exit Code:
- Returns 0 on success and -1 on error.
- Usage: hdfs dfs -test -[ezd] URI
-
- Options:
- -e check to see if the file exists. Return 0 if true.
- -z check to see if the file is zero length. Return 0 if true.
- -d check to see if the path is directory. Return 0 if true.
Example:
- hdfs dfs -test -e filename
-
- Usage: hdfs dfs -text <src>
-
-
- Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream. -
-
- Usage: hdfs dfs -touchz URI [URI …]
-
-
- Create a file of zero length. -
-Example:
- hadoop -touchz pathname
- Exit Code:
- Returns 0 on success and -1 on error.