HADOOP-11495. Backport "convert site documentation from apt to markdown" to branch-2 (Masatake Iwasaki via Colin P. McCabe)

(cherry-picked from commit b6fc1f3e43)

Conflicts:
    hadoop-common-project/hadoop-common/src/site/apt/ClusterSetup.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/CommandsManual.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/FileSystemShell.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/RackAwareness.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/SingleCluster.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/FileSystemShell.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/SecureMode.apt.vm
    hadoop-common-project/hadoop-common/src/site/apt/Tracing.apt.vm
    hadoop-project/src/site/site.xml
This commit is contained in:
Colin Patrick Mccabe 2015-02-24 15:48:58 -08:00
parent efb7e287f4
commit 343cffb0ea
31 changed files with 3339 additions and 4890 deletions

View File

@ -212,6 +212,9 @@ Release 2.7.0 - UNRELEASED
HADOOP-11607. Reduce log spew in S3AFileSystem. (Lei (Eddy) Xu via stevel) HADOOP-11607. Reduce log spew in S3AFileSystem. (Lei (Eddy) Xu via stevel)
HADOOP-11495. Convert site documentation from apt to markdown (Masatake
Iwasaki via aw)
OPTIMIZATIONS OPTIMIZATIONS
HADOOP-11323. WritableComparator#compare keeps reference to byte array. HADOOP-11323. WritableComparator#compare keeps reference to byte array.

View File

@ -1,83 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop MapReduce Next Generation ${project.version} - CLI MiniCluster.
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - CLI MiniCluster.
%{toc|section=1|fromDepth=0}
* {Purpose}
Using the CLI MiniCluster, users can simply start and stop a single-node
Hadoop cluster with a single command, and without the need to set any
environment variables or manage configuration files. The CLI MiniCluster
starts both a <<<YARN>>>/<<<MapReduce>>> & <<<HDFS>>> clusters.
This is useful for cases where users want to quickly experiment with a real
Hadoop cluster or test non-Java programs that rely on significant Hadoop
functionality.
* {Hadoop Tarball}
You should be able to obtain the Hadoop tarball from the release. Also, you
can directly create a tarball from the source:
+---+
$ mvn clean install -DskipTests
$ mvn package -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip
+---+
<<NOTE:>> You will need {{{http://code.google.com/p/protobuf/}protoc 2.5.0}}
installed.
The tarball should be available in <<<hadoop-dist/target/>>> directory.
* {Running the MiniCluster}
From inside the root directory of the extracted tarball, you can start the CLI
MiniCluster using the following command:
+---+
$ bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-${project.version}-tests.jar minicluster -rmport RM_PORT -jhsport JHS_PORT
+---+
In the example command above, <<<RM_PORT>>> and <<<JHS_PORT>>> should be
replaced by the user's choice of these port numbers. If not specified, random
free ports will be used.
There are a number of command line arguments that the users can use to control
which services to start, and to pass other configuration properties.
The available command line arguments:
+---+
$ -D <property=value> Options to pass into configuration object
$ -datanodes <arg> How many datanodes to start (default 1)
$ -format Format the DFS (default false)
$ -help Prints option help.
$ -jhsport <arg> JobHistoryServer port (default 0--we choose)
$ -namenode <arg> URL of the namenode (default is either the DFS
$ cluster or a temporary dir)
$ -nnport <arg> NameNode port (default 0--we choose)
$ -nodemanagers <arg> How many nodemanagers to start (default 1)
$ -nodfs Don't start a mini DFS cluster
$ -nomr Don't start a mini MR cluster
$ -rmport <arg> ResourceManager port (default 0--we choose)
$ -writeConfig <path> Save configuration to this XML file.
$ -writeDetails <path> Write basic information to this JSON file.
+---+
To display this full list of available arguments, the user can pass the
<<<-help>>> argument to the above command.

View File

@ -1,541 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Apache Hadoop Compatibility
---
---
${maven.build.timestamp}
Apache Hadoop Compatibility
%{toc|section=1|fromDepth=0}
* Purpose
This document captures the compatibility goals of the Apache Hadoop
project. The different types of compatibility between Hadoop
releases that affects Hadoop developers, downstream projects, and
end-users are enumerated. For each type of compatibility we:
* describe the impact on downstream projects or end-users
* where applicable, call out the policy adopted by the Hadoop
developers when incompatible changes are permitted.
* Compatibility types
** Java API
Hadoop interfaces and classes are annotated to describe the intended
audience and stability in order to maintain compatibility with previous
releases. See {{{./InterfaceClassification.html}Hadoop Interface
Classification}}
for details.
* InterfaceAudience: captures the intended audience, possible
values are Public (for end users and external projects),
LimitedPrivate (for other Hadoop components, and closely related
projects like YARN, MapReduce, HBase etc.), and Private (for intra component
use).
* InterfaceStability: describes what types of interface changes are
permitted. Possible values are Stable, Evolving, Unstable, and Deprecated.
*** Use Cases
* Public-Stable API compatibility is required to ensure end-user programs
and downstream projects continue to work without modification.
* LimitedPrivate-Stable API compatibility is required to allow upgrade of
individual components across minor releases.
* Private-Stable API compatibility is required for rolling upgrades.
*** Policy
* Public-Stable APIs must be deprecated for at least one major release
prior to their removal in a major release.
* LimitedPrivate-Stable APIs can change across major releases,
but not within a major release.
* Private-Stable APIs can change across major releases,
but not within a major release.
* Classes not annotated are implicitly "Private". Class members not
annotated inherit the annotations of the enclosing class.
* Note: APIs generated from the proto files need to be compatible for
rolling-upgrades. See the section on wire-compatibility for more details.
The compatibility policies for APIs and wire-communication need to go
hand-in-hand to address this.
** Semantic compatibility
Apache Hadoop strives to ensure that the behavior of APIs remains
consistent over versions, though changes for correctness may result in
changes in behavior. Tests and javadocs specify the API's behavior.
The community is in the process of specifying some APIs more rigorously,
and enhancing test suites to verify compliance with the specification,
effectively creating a formal specification for the subset of behaviors
that can be easily tested.
*** Policy
The behavior of API may be changed to fix incorrect behavior,
such a change to be accompanied by updating existing buggy tests or adding
tests in cases there were none prior to the change.
** Wire compatibility
Wire compatibility concerns data being transmitted over the wire
between Hadoop processes. Hadoop uses Protocol Buffers for most RPC
communication. Preserving compatibility requires prohibiting
modification as described below.
Non-RPC communication should be considered as well,
for example using HTTP to transfer an HDFS image as part of
snapshotting or transferring MapTask output. The potential
communications can be categorized as follows:
* Client-Server: communication between Hadoop clients and servers (e.g.,
the HDFS client to NameNode protocol, or the YARN client to
ResourceManager protocol).
* Client-Server (Admin): It is worth distinguishing a subset of the
Client-Server protocols used solely by administrative commands (e.g.,
the HAAdmin protocol) as these protocols only impact administrators
who can tolerate changes that end users (which use general
Client-Server protocols) can not.
* Server-Server: communication between servers (e.g., the protocol between
the DataNode and NameNode, or NodeManager and ResourceManager)
*** Use Cases
* Client-Server compatibility is required to allow users to
continue using the old clients even after upgrading the server
(cluster) to a later version (or vice versa). For example, a
Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
* Client-Server compatibility is also required to allow users to upgrade the
client before upgrading the server (cluster). For example, a Hadoop 2.4.0
client talking to a Hadoop 2.3.0 cluster. This allows deployment of
client-side bug fixes ahead of full cluster upgrades. Note that new cluster
features invoked by new client APIs or shell commands will not be usable.
YARN applications that attempt to use new APIs (including new fields in data
structures) that have not yet deployed to the cluster can expect link
exceptions.
* Client-Server compatibility is also required to allow upgrading
individual components without upgrading others. For example,
upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
* Server-Server compatibility is required to allow mixed versions
within an active cluster so the cluster may be upgraded without
downtime in a rolling fashion.
*** Policy
* Both Client-Server and Server-Server compatibility is preserved within a
major release. (Different policies for different categories are yet to be
considered.)
* Compatibility can be broken only at a major release, though breaking compatibility
even at major releases has grave consequences and should be discussed in the Hadoop community.
* Hadoop protocols are defined in .proto (ProtocolBuffers) files.
Client-Server protocols and Server-protocol .proto files are marked as stable.
When a .proto file is marked as stable it means that changes should be made
in a compatible fashion as described below:
* The following changes are compatible and are allowed at any time:
* Add an optional field, with the expectation that the code deals with the field missing due to communication with an older version of the code.
* Add a new rpc/method to the service
* Add a new optional request to a Message
* Rename a field
* Rename a .proto file
* Change .proto annotations that effect code generation (e.g. name of java package)
* The following changes are incompatible but can be considered only at a major release
* Change the rpc/method name
* Change the rpc/method parameter type or return type
* Remove an rpc/method
* Change the service name
* Change the name of a Message
* Modify a field type in an incompatible way (as defined recursively)
* Change an optional field to required
* Add or delete a required field
* Delete an optional field as long as the optional field has reasonable defaults to allow deletions
* The following changes are incompatible and hence never allowed
* Change a field id
* Reuse an old field that was previously deleted.
* Field numbers are cheap and changing and reusing is not a good idea.
** Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI
As Apache Hadoop revisions are upgraded end-users reasonably expect that
their applications should continue to work without any modifications.
This is fulfilled as a result of support API compatibility, Semantic
compatibility and Wire compatibility.
However, Apache Hadoop is a very complex, distributed system and services a
very wide variety of use-cases. In particular, Apache Hadoop MapReduce is a
very, very wide API; in the sense that end-users may make wide-ranging
assumptions such as layout of the local disk when their map/reduce tasks are
executing, environment variables for their tasks etc. In such cases, it
becomes very hard to fully specify, and support, absolute compatibility.
*** Use cases
* Existing MapReduce applications, including jars of existing packaged
end-user applications and projects such as Apache Pig, Apache Hive,
Cascading etc. should work unmodified when pointed to an upgraded Apache
Hadoop cluster within a major release.
* Existing YARN applications, including jars of existing packaged
end-user applications and projects such as Apache Tez etc. should work
unmodified when pointed to an upgraded Apache Hadoop cluster within a
major release.
* Existing applications which transfer data in/out of HDFS, including jars
of existing packaged end-user applications and frameworks such as Apache
Flume, should work unmodified when pointed to an upgraded Apache Hadoop
cluster within a major release.
*** Policy
* Existing MapReduce, YARN & HDFS applications and frameworks should work
unmodified within a major release i.e. Apache Hadoop ABI is supported.
* A very minor fraction of applications maybe affected by changes to disk
layouts etc., the developer community will strive to minimize these
changes and will not make them within a minor version. In more egregious
cases, we will consider strongly reverting these breaking changes and
invalidating offending releases if necessary.
* In particular for MapReduce applications, the developer community will
try our best to support provide binary compatibility across major
releases e.g. applications using org.apache.hadoop.mapred.
* APIs are supported compatibly across hadoop-1.x and hadoop-2.x. See
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html}
Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x}}
for more details.
** REST APIs
REST API compatibility corresponds to both the request (URLs) and responses
to each request (content, which may contain other URLs). Hadoop REST APIs
are specifically meant for stable use by clients across releases,
even major releases. The following are the exposed REST APIs:
* {{{../hadoop-hdfs/WebHDFS.html}WebHDFS}} - Stable
* {{{../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html}ResourceManager}}
* {{{../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html}NodeManager}}
* {{{../../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html}MR Application Master}}
* {{{../../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html}History Server}}
*** Policy
The APIs annotated stable in the text above preserve compatibility
across at least one major release, and maybe deprecated by a newer
version of the REST API in a major release.
** Metrics/JMX
While the Metrics API compatibility is governed by Java API compatibility,
the actual metrics exposed by Hadoop need to be compatible for users to
be able to automate using them (scripts etc.). Adding additional metrics
is compatible. Modifying (eg changing the unit or measurement) or removing
existing metrics breaks compatibility. Similarly, changes to JMX MBean
object names also break compatibility.
*** Policy
Metrics should preserve compatibility within the major release.
** File formats & Metadata
User and system level data (including metadata) is stored in files of
different formats. Changes to the metadata or the file formats used to
store data/metadata can lead to incompatibilities between versions.
*** User-level file formats
Changes to formats that end-users use to store their data can prevent
them for accessing the data in later releases, and hence it is highly
important to keep those file-formats compatible. One can always add a
"new" format improving upon an existing format. Examples of these formats
include har, war, SequenceFileFormat etc.
**** Policy
* Non-forward-compatible user-file format changes are
restricted to major releases. When user-file formats change, new
releases are expected to read existing formats, but may write data
in formats incompatible with prior releases. Also, the community
shall prefer to create a new format that programs must opt in to
instead of making incompatible changes to existing formats.
*** System-internal file formats
Hadoop internal data is also stored in files and again changing these
formats can lead to incompatibilities. While such changes are not as
devastating as the user-level file formats, a policy on when the
compatibility can be broken is important.
**** MapReduce
MapReduce uses formats like I-File to store MapReduce-specific data.
***** Policy
MapReduce-internal formats like IFile maintain compatibility within a
major release. Changes to these formats can cause in-flight jobs to fail
and hence we should ensure newer clients can fetch shuffle-data from old
servers in a compatible manner.
**** HDFS Metadata
HDFS persists metadata (the image and edit logs) in a particular format.
Incompatible changes to either the format or the metadata prevent
subsequent releases from reading older metadata. Such incompatible
changes might require an HDFS "upgrade" to convert the metadata to make
it accessible. Some changes can require more than one such "upgrades".
Depending on the degree of incompatibility in the changes, the following
potential scenarios can arise:
* Automatic: The image upgrades automatically, no need for an explicit
"upgrade".
* Direct: The image is upgradable, but might require one explicit release
"upgrade".
* Indirect: The image is upgradable, but might require upgrading to
intermediate release(s) first.
* Not upgradeable: The image is not upgradeable.
***** Policy
* A release upgrade must allow a cluster to roll-back to the older
version and its older disk format. The rollback needs to restore the
original data, but not required to restore the updated data.
* HDFS metadata changes must be upgradeable via any of the upgrade
paths - automatic, direct or indirect.
* More detailed policies based on the kind of upgrade are yet to be
considered.
** Command Line Interface (CLI)
The Hadoop command line programs may be use either directly via the
system shell or via shell scripts. Changing the path of a command,
removing or renaming command line options, the order of arguments,
or the command return code and output break compatibility and
may adversely affect users.
*** Policy
CLI commands are to be deprecated (warning when used) for one
major release before they are removed or incompatibly modified in
a subsequent major release.
** Web UI
Web UI, particularly the content and layout of web pages, changes
could potentially interfere with attempts to screen scrape the web
pages for information.
*** Policy
Web pages are not meant to be scraped and hence incompatible
changes to them are allowed at any time. Users are expected to use
REST APIs to get any information.
** Hadoop Configuration Files
Users use (1) Hadoop-defined properties to configure and provide hints to
Hadoop and (2) custom properties to pass information to jobs. Hence,
compatibility of config properties is two-fold:
* Modifying key-names, units of values, and default values of Hadoop-defined
properties.
* Custom configuration property keys should not conflict with the
namespace of Hadoop-defined properties. Typically, users should
avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net,
file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
*** Policy
* Hadoop-defined properties are to be deprecated at least for one
major release before being removed. Modifying units for existing
properties is not allowed.
* The default values of Hadoop-defined properties can
be changed across minor/major releases, but will remain the same
across point releases within a minor release.
* Currently, there is NO explicit policy regarding when new
prefixes can be added/removed, and the list of prefixes to be
avoided for custom configuration properties. However, as noted above,
users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs,
net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
** Directory Structure
Source code, artifacts (source and tests), user logs, configuration files,
output and job history are all stored on disk either local file system or
HDFS. Changing the directory structure of these user-accessible
files break compatibility, even in cases where the original path is
preserved via symbolic links (if, for example, the path is accessed
by a servlet that is configured to not follow symbolic links).
*** Policy
* The layout of source code and build artifacts can change
anytime, particularly so across major versions. Within a major
version, the developers will attempt (no guarantees) to preserve
the directory structure; however, individual files can be
added/moved/deleted. The best way to ensure patches stay in sync
with the code is to get them committed to the Apache source tree.
* The directory structure of configuration files, user logs, and
job history will be preserved across minor and point releases
within a major release.
** Java Classpath
User applications built against Hadoop might add all Hadoop jars
(including Hadoop's library dependencies) to the application's
classpath. Adding new dependencies or updating the version of
existing dependencies may interfere with those in applications'
classpaths.
*** Policy
Currently, there is NO policy on when Hadoop's dependencies can
change.
** Environment variables
Users and related projects often utilize the exported environment
variables (eg HADOOP_CONF_DIR), therefore removing or renaming
environment variables is an incompatible change.
*** Policy
Currently, there is NO policy on when the environment variables
can change. Developers try to limit changes to major releases.
** Build artifacts
Hadoop uses maven for project management and changing the artifacts
can affect existing user workflows.
*** Policy
* Test artifacts: The test jars generated are strictly for internal
use and are not expected to be used outside of Hadoop, similar to
APIs annotated @Private, @Unstable.
* Built artifacts: The hadoop-client artifact (maven
groupId:artifactId) stays compatible within a major release,
while the other artifacts can change in incompatible ways.
** Hardware/Software Requirements
To keep up with the latest advances in hardware, operating systems,
JVMs, and other software, new Hadoop releases or some of their
features might require higher versions of the same. For a specific
environment, upgrading Hadoop might require upgrading other
dependent software components.
*** Policies
* Hardware
* Architecture: The community has no plans to restrict Hadoop to
specific architectures, but can have family-specific
optimizations.
* Minimum resources: While there are no guarantees on the
minimum resources required by Hadoop daemons, the community
attempts to not increase requirements within a minor release.
* Operating Systems: The community will attempt to maintain the
same OS requirements (OS kernel versions) within a minor
release. Currently GNU/Linux and Microsoft Windows are the OSes officially
supported by the community while Apache Hadoop is known to work reasonably
well on other OSes such as Apple MacOSX, Solaris etc.
* The JVM requirements will not change across point releases
within the same minor release except if the JVM version under
question becomes unsupported. Minor/major releases might require
later versions of JVM for some/all of the supported operating
systems.
* Other software: The community tries to maintain the minimum
versions of additional software required by Hadoop. For example,
ssh, kerberos etc.
* References
Here are some relevant JIRAs and pages related to the topic:
* The evolution of this document -
{{{https://issues.apache.org/jira/browse/HADOOP-9517}HADOOP-9517}}
* Binary compatibility for MapReduce end-user applications between hadoop-1.x and hadoop-2.x -
{{{../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html}
MapReduce Compatibility between hadoop-1.x and hadoop-2.x}}
* Annotations for interfaces as per interface classification
schedule -
{{{https://issues.apache.org/jira/browse/HADOOP-7391}HADOOP-7391}}
{{{./InterfaceClassification.html}Hadoop Interface Classification}}
* Compatibility for Hadoop 1.x releases -
{{{https://issues.apache.org/jira/browse/HADOOP-5071}HADOOP-5071}}
* The {{{http://wiki.apache.org/hadoop/Roadmap}Hadoop Roadmap}} page
that captures other release policies

View File

@ -1,552 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop ${project.version}
---
---
${maven.build.timestamp}
Deprecated Properties
The following table lists the configuration property names that are
deprecated in this version of Hadoop, and their replacements.
*-------------------------------+-----------------------+
|| <<Deprecated property name>> || <<New property name>>|
*-------------------------------+-----------------------+
|create.empty.dir.if.nonexist | mapreduce.jobcontrol.createdir.ifnotexist
*---+---+
|dfs.access.time.precision | dfs.namenode.accesstime.precision
*---+---+
|dfs.backup.address | dfs.namenode.backup.address
*---+---+
|dfs.backup.http.address | dfs.namenode.backup.http-address
*---+---+
|dfs.balance.bandwidthPerSec | dfs.datanode.balance.bandwidthPerSec
*---+---+
|dfs.block.size | dfs.blocksize
*---+---+
|dfs.data.dir | dfs.datanode.data.dir
*---+---+
|dfs.datanode.max.xcievers | dfs.datanode.max.transfer.threads
*---+---+
|dfs.df.interval | fs.df.interval
*---+---+
|dfs.federation.nameservice.id | dfs.nameservice.id
*---+---+
|dfs.federation.nameservices | dfs.nameservices
*---+---+
|dfs.http.address | dfs.namenode.http-address
*---+---+
|dfs.https.address | dfs.namenode.https-address
*---+---+
|dfs.https.client.keystore.resource | dfs.client.https.keystore.resource
*---+---+
|dfs.https.need.client.auth | dfs.client.https.need-auth
*---+---+
|dfs.max.objects | dfs.namenode.max.objects
*---+---+
|dfs.max-repl-streams | dfs.namenode.replication.max-streams
*---+---+
|dfs.name.dir | dfs.namenode.name.dir
*---+---+
|dfs.name.dir.restore | dfs.namenode.name.dir.restore
*---+---+
|dfs.name.edits.dir | dfs.namenode.edits.dir
*---+---+
|dfs.permissions | dfs.permissions.enabled
*---+---+
|dfs.permissions.supergroup | dfs.permissions.superusergroup
*---+---+
|dfs.read.prefetch.size | dfs.client.read.prefetch.size
*---+---+
|dfs.replication.considerLoad | dfs.namenode.replication.considerLoad
*---+---+
|dfs.replication.interval | dfs.namenode.replication.interval
*---+---+
|dfs.replication.min | dfs.namenode.replication.min
*---+---+
|dfs.replication.pending.timeout.sec | dfs.namenode.replication.pending.timeout-sec
*---+---+
|dfs.safemode.extension | dfs.namenode.safemode.extension
*---+---+
|dfs.safemode.threshold.pct | dfs.namenode.safemode.threshold-pct
*---+---+
|dfs.secondary.http.address | dfs.namenode.secondary.http-address
*---+---+
|dfs.socket.timeout | dfs.client.socket-timeout
*---+---+
|dfs.umaskmode | fs.permissions.umask-mode
*---+---+
|dfs.write.packet.size | dfs.client-write-packet-size
*---+---+
|fs.checkpoint.dir | dfs.namenode.checkpoint.dir
*---+---+
|fs.checkpoint.edits.dir | dfs.namenode.checkpoint.edits.dir
*---+---+
|fs.checkpoint.period | dfs.namenode.checkpoint.period
*---+---+
|fs.default.name | fs.defaultFS
*---+---+
|hadoop.configured.node.mapping | net.topology.configured.node.mapping
*---+---+
|hadoop.job.history.location | mapreduce.jobtracker.jobhistory.location
*---+---+
|hadoop.native.lib | io.native.lib.available
*---+---+
|hadoop.net.static.resolutions | mapreduce.tasktracker.net.static.resolutions
*---+---+
|hadoop.pipes.command-file.keep | mapreduce.pipes.commandfile.preserve
*---+---+
|hadoop.pipes.executable.interpretor | mapreduce.pipes.executable.interpretor
*---+---+
|hadoop.pipes.executable | mapreduce.pipes.executable
*---+---+
|hadoop.pipes.java.mapper | mapreduce.pipes.isjavamapper
*---+---+
|hadoop.pipes.java.recordreader | mapreduce.pipes.isjavarecordreader
*---+---+
|hadoop.pipes.java.recordwriter | mapreduce.pipes.isjavarecordwriter
*---+---+
|hadoop.pipes.java.reducer | mapreduce.pipes.isjavareducer
*---+---+
|hadoop.pipes.partitioner | mapreduce.pipes.partitioner
*---+---+
|heartbeat.recheck.interval | dfs.namenode.heartbeat.recheck-interval
*---+---+
|io.bytes.per.checksum | dfs.bytes-per-checksum
*---+---+
|io.sort.factor | mapreduce.task.io.sort.factor
*---+---+
|io.sort.mb | mapreduce.task.io.sort.mb
*---+---+
|io.sort.spill.percent | mapreduce.map.sort.spill.percent
*---+---+
|jobclient.completion.poll.interval | mapreduce.client.completion.pollinterval
*---+---+
|jobclient.output.filter | mapreduce.client.output.filter
*---+---+
|jobclient.progress.monitor.poll.interval | mapreduce.client.progressmonitor.pollinterval
*---+---+
|job.end.notification.url | mapreduce.job.end-notification.url
*---+---+
|job.end.retry.attempts | mapreduce.job.end-notification.retry.attempts
*---+---+
|job.end.retry.interval | mapreduce.job.end-notification.retry.interval
*---+---+
|job.local.dir | mapreduce.job.local.dir
*---+---+
|keep.failed.task.files | mapreduce.task.files.preserve.failedtasks
*---+---+
|keep.task.files.pattern | mapreduce.task.files.preserve.filepattern
*---+---+
|key.value.separator.in.input.line | mapreduce.input.keyvaluelinerecordreader.key.value.separator
*---+---+
|local.cache.size | mapreduce.tasktracker.cache.local.size
*---+---+
|map.input.file | mapreduce.map.input.file
*---+---+
|map.input.length | mapreduce.map.input.length
*---+---+
|map.input.start | mapreduce.map.input.start
*---+---+
|map.output.key.field.separator | mapreduce.map.output.key.field.separator
*---+---+
|map.output.key.value.fields.spec | mapreduce.fieldsel.map.output.key.value.fields.spec
*---+---+
|mapred.acls.enabled | mapreduce.cluster.acls.enabled
*---+---+
|mapred.binary.partitioner.left.offset | mapreduce.partition.binarypartitioner.left.offset
*---+---+
|mapred.binary.partitioner.right.offset | mapreduce.partition.binarypartitioner.right.offset
*---+---+
|mapred.cache.archives | mapreduce.job.cache.archives
*---+---+
|mapred.cache.archives.timestamps | mapreduce.job.cache.archives.timestamps
*---+---+
|mapred.cache.files | mapreduce.job.cache.files
*---+---+
|mapred.cache.files.timestamps | mapreduce.job.cache.files.timestamps
*---+---+
|mapred.cache.localArchives | mapreduce.job.cache.local.archives
*---+---+
|mapred.cache.localFiles | mapreduce.job.cache.local.files
*---+---+
|mapred.child.tmp | mapreduce.task.tmp.dir
*---+---+
|mapred.cluster.average.blacklist.threshold | mapreduce.jobtracker.blacklist.average.threshold
*---+---+
|mapred.cluster.map.memory.mb | mapreduce.cluster.mapmemory.mb
*---+---+
|mapred.cluster.max.map.memory.mb | mapreduce.jobtracker.maxmapmemory.mb
*---+---+
|mapred.cluster.max.reduce.memory.mb | mapreduce.jobtracker.maxreducememory.mb
*---+---+
|mapred.cluster.reduce.memory.mb | mapreduce.cluster.reducememory.mb
*---+---+
|mapred.committer.job.setup.cleanup.needed | mapreduce.job.committer.setup.cleanup.needed
*---+---+
|mapred.compress.map.output | mapreduce.map.output.compress
*---+---+
|mapred.data.field.separator | mapreduce.fieldsel.data.field.separator
*---+---+
|mapred.debug.out.lines | mapreduce.task.debugout.lines
*---+---+
|mapred.healthChecker.interval | mapreduce.tasktracker.healthchecker.interval
*---+---+
|mapred.healthChecker.script.args | mapreduce.tasktracker.healthchecker.script.args
*---+---+
|mapred.healthChecker.script.path | mapreduce.tasktracker.healthchecker.script.path
*---+---+
|mapred.healthChecker.script.timeout | mapreduce.tasktracker.healthchecker.script.timeout
*---+---+
|mapred.heartbeats.in.second | mapreduce.jobtracker.heartbeats.in.second
*---+---+
|mapred.hosts.exclude | mapreduce.jobtracker.hosts.exclude.filename
*---+---+
|mapred.hosts | mapreduce.jobtracker.hosts.filename
*---+---+
|mapred.inmem.merge.threshold | mapreduce.reduce.merge.inmem.threshold
*---+---+
|mapred.input.dir.formats | mapreduce.input.multipleinputs.dir.formats
*---+---+
|mapred.input.dir.mappers | mapreduce.input.multipleinputs.dir.mappers
*---+---+
|mapred.input.dir | mapreduce.input.fileinputformat.inputdir
*---+---+
|mapred.input.pathFilter.class | mapreduce.input.pathFilter.class
*---+---+
|mapred.jar | mapreduce.job.jar
*---+---+
|mapred.job.classpath.archives | mapreduce.job.classpath.archives
*---+---+
|mapred.job.classpath.files | mapreduce.job.classpath.files
*---+---+
|mapred.job.id | mapreduce.job.id
*---+---+
|mapred.jobinit.threads | mapreduce.jobtracker.jobinit.threads
*---+---+
|mapred.job.map.memory.mb | mapreduce.map.memory.mb
*---+---+
|mapred.job.name | mapreduce.job.name
*---+---+
|mapred.job.priority | mapreduce.job.priority
*---+---+
|mapred.job.queue.name | mapreduce.job.queuename
*---+---+
|mapred.job.reduce.input.buffer.percent | mapreduce.reduce.input.buffer.percent
*---+---+
|mapred.job.reduce.markreset.buffer.percent | mapreduce.reduce.markreset.buffer.percent
*---+---+
|mapred.job.reduce.memory.mb | mapreduce.reduce.memory.mb
*---+---+
|mapred.job.reduce.total.mem.bytes | mapreduce.reduce.memory.totalbytes
*---+---+
|mapred.job.reuse.jvm.num.tasks | mapreduce.job.jvm.numtasks
*---+---+
|mapred.job.shuffle.input.buffer.percent | mapreduce.reduce.shuffle.input.buffer.percent
*---+---+
|mapred.job.shuffle.merge.percent | mapreduce.reduce.shuffle.merge.percent
*---+---+
|mapred.job.tracker.handler.count | mapreduce.jobtracker.handler.count
*---+---+
|mapred.job.tracker.history.completed.location | mapreduce.jobtracker.jobhistory.completed.location
*---+---+
|mapred.job.tracker.http.address | mapreduce.jobtracker.http.address
*---+---+
|mapred.jobtracker.instrumentation | mapreduce.jobtracker.instrumentation
*---+---+
|mapred.jobtracker.job.history.block.size | mapreduce.jobtracker.jobhistory.block.size
*---+---+
|mapred.job.tracker.jobhistory.lru.cache.size | mapreduce.jobtracker.jobhistory.lru.cache.size
*---+---+
|mapred.job.tracker | mapreduce.jobtracker.address
*---+---+
|mapred.jobtracker.maxtasks.per.job | mapreduce.jobtracker.maxtasks.perjob
*---+---+
|mapred.job.tracker.persist.jobstatus.active | mapreduce.jobtracker.persist.jobstatus.active
*---+---+
|mapred.job.tracker.persist.jobstatus.dir | mapreduce.jobtracker.persist.jobstatus.dir
*---+---+
|mapred.job.tracker.persist.jobstatus.hours | mapreduce.jobtracker.persist.jobstatus.hours
*---+---+
|mapred.jobtracker.restart.recover | mapreduce.jobtracker.restart.recover
*---+---+
|mapred.job.tracker.retiredjobs.cache.size | mapreduce.jobtracker.retiredjobs.cache.size
*---+---+
|mapred.job.tracker.retire.jobs | mapreduce.jobtracker.retirejobs
*---+---+
|mapred.jobtracker.taskalloc.capacitypad | mapreduce.jobtracker.taskscheduler.taskalloc.capacitypad
*---+---+
|mapred.jobtracker.taskScheduler | mapreduce.jobtracker.taskscheduler
*---+---+
|mapred.jobtracker.taskScheduler.maxRunningTasksPerJob | mapreduce.jobtracker.taskscheduler.maxrunningtasks.perjob
*---+---+
|mapred.join.expr | mapreduce.join.expr
*---+---+
|mapred.join.keycomparator | mapreduce.join.keycomparator
*---+---+
|mapred.lazy.output.format | mapreduce.output.lazyoutputformat.outputformat
*---+---+
|mapred.line.input.format.linespermap | mapreduce.input.lineinputformat.linespermap
*---+---+
|mapred.linerecordreader.maxlength | mapreduce.input.linerecordreader.line.maxlength
*---+---+
|mapred.local.dir | mapreduce.cluster.local.dir
*---+---+
|mapred.local.dir.minspacekill | mapreduce.tasktracker.local.dir.minspacekill
*---+---+
|mapred.local.dir.minspacestart | mapreduce.tasktracker.local.dir.minspacestart
*---+---+
|mapred.map.child.env | mapreduce.map.env
*---+---+
|mapred.map.child.java.opts | mapreduce.map.java.opts
*---+---+
|mapred.map.child.log.level | mapreduce.map.log.level
*---+---+
|mapred.map.max.attempts | mapreduce.map.maxattempts
*---+---+
|mapred.map.output.compression.codec | mapreduce.map.output.compress.codec
*---+---+
|mapred.mapoutput.key.class | mapreduce.map.output.key.class
*---+---+
|mapred.mapoutput.value.class | mapreduce.map.output.value.class
*---+---+
|mapred.mapper.regex.group | mapreduce.mapper.regexmapper..group
*---+---+
|mapred.mapper.regex | mapreduce.mapper.regex
*---+---+
|mapred.map.task.debug.script | mapreduce.map.debug.script
*---+---+
|mapred.map.tasks | mapreduce.job.maps
*---+---+
|mapred.map.tasks.speculative.execution | mapreduce.map.speculative
*---+---+
|mapred.max.map.failures.percent | mapreduce.map.failures.maxpercent
*---+---+
|mapred.max.reduce.failures.percent | mapreduce.reduce.failures.maxpercent
*---+---+
|mapred.max.split.size | mapreduce.input.fileinputformat.split.maxsize
*---+---+
|mapred.max.tracker.blacklists | mapreduce.jobtracker.tasktracker.maxblacklists
*---+---+
|mapred.max.tracker.failures | mapreduce.job.maxtaskfailures.per.tracker
*---+---+
|mapred.merge.recordsBeforeProgress | mapreduce.task.merge.progress.records
*---+---+
|mapred.min.split.size | mapreduce.input.fileinputformat.split.minsize
*---+---+
|mapred.min.split.size.per.node | mapreduce.input.fileinputformat.split.minsize.per.node
*---+---+
|mapred.min.split.size.per.rack | mapreduce.input.fileinputformat.split.minsize.per.rack
*---+---+
|mapred.output.compression.codec | mapreduce.output.fileoutputformat.compress.codec
*---+---+
|mapred.output.compression.type | mapreduce.output.fileoutputformat.compress.type
*---+---+
|mapred.output.compress | mapreduce.output.fileoutputformat.compress
*---+---+
|mapred.output.dir | mapreduce.output.fileoutputformat.outputdir
*---+---+
|mapred.output.key.class | mapreduce.job.output.key.class
*---+---+
|mapred.output.key.comparator.class | mapreduce.job.output.key.comparator.class
*---+---+
|mapred.output.value.class | mapreduce.job.output.value.class
*---+---+
|mapred.output.value.groupfn.class | mapreduce.job.output.group.comparator.class
*---+---+
|mapred.permissions.supergroup | mapreduce.cluster.permissions.supergroup
*---+---+
|mapred.pipes.user.inputformat | mapreduce.pipes.inputformat
*---+---+
|mapred.reduce.child.env | mapreduce.reduce.env
*---+---+
|mapred.reduce.child.java.opts | mapreduce.reduce.java.opts
*---+---+
|mapred.reduce.child.log.level | mapreduce.reduce.log.level
*---+---+
|mapred.reduce.max.attempts | mapreduce.reduce.maxattempts
*---+---+
|mapred.reduce.parallel.copies | mapreduce.reduce.shuffle.parallelcopies
*---+---+
|mapred.reduce.slowstart.completed.maps | mapreduce.job.reduce.slowstart.completedmaps
*---+---+
|mapred.reduce.task.debug.script | mapreduce.reduce.debug.script
*---+---+
|mapred.reduce.tasks | mapreduce.job.reduces
*---+---+
|mapred.reduce.tasks.speculative.execution | mapreduce.reduce.speculative
*---+---+
|mapred.seqbinary.output.key.class | mapreduce.output.seqbinaryoutputformat.key.class
*---+---+
|mapred.seqbinary.output.value.class | mapreduce.output.seqbinaryoutputformat.value.class
*---+---+
|mapred.shuffle.connect.timeout | mapreduce.reduce.shuffle.connect.timeout
*---+---+
|mapred.shuffle.read.timeout | mapreduce.reduce.shuffle.read.timeout
*---+---+
|mapred.skip.attempts.to.start.skipping | mapreduce.task.skip.start.attempts
*---+---+
|mapred.skip.map.auto.incr.proc.count | mapreduce.map.skip.proc-count.auto-incr
*---+---+
|mapred.skip.map.max.skip.records | mapreduce.map.skip.maxrecords
*---+---+
|mapred.skip.on | mapreduce.job.skiprecords
*---+---+
|mapred.skip.out.dir | mapreduce.job.skip.outdir
*---+---+
|mapred.skip.reduce.auto.incr.proc.count | mapreduce.reduce.skip.proc-count.auto-incr
*---+---+
|mapred.skip.reduce.max.skip.groups | mapreduce.reduce.skip.maxgroups
*---+---+
|mapred.speculative.execution.slowNodeThreshold | mapreduce.job.speculative.slownodethreshold
*---+---+
|mapred.speculative.execution.slowTaskThreshold | mapreduce.job.speculative.slowtaskthreshold
*---+---+
|mapred.speculative.execution.speculativeCap | mapreduce.job.speculative.speculativecap
*---+---+
|mapred.submit.replication | mapreduce.client.submit.file.replication
*---+---+
|mapred.system.dir | mapreduce.jobtracker.system.dir
*---+---+
|mapred.task.cache.levels | mapreduce.jobtracker.taskcache.levels
*---+---+
|mapred.task.id | mapreduce.task.attempt.id
*---+---+
|mapred.task.is.map | mapreduce.task.ismap
*---+---+
|mapred.task.partition | mapreduce.task.partition
*---+---+
|mapred.task.profile | mapreduce.task.profile
*---+---+
|mapred.task.profile.maps | mapreduce.task.profile.maps
*---+---+
|mapred.task.profile.params | mapreduce.task.profile.params
*---+---+
|mapred.task.profile.reduces | mapreduce.task.profile.reduces
*---+---+
|mapred.task.timeout | mapreduce.task.timeout
*---+---+
|mapred.tasktracker.dns.interface | mapreduce.tasktracker.dns.interface
*---+---+
|mapred.tasktracker.dns.nameserver | mapreduce.tasktracker.dns.nameserver
*---+---+
|mapred.tasktracker.events.batchsize | mapreduce.tasktracker.events.batchsize
*---+---+
|mapred.tasktracker.expiry.interval | mapreduce.jobtracker.expire.trackers.interval
*---+---+
|mapred.task.tracker.http.address | mapreduce.tasktracker.http.address
*---+---+
|mapred.tasktracker.indexcache.mb | mapreduce.tasktracker.indexcache.mb
*---+---+
|mapred.tasktracker.instrumentation | mapreduce.tasktracker.instrumentation
*---+---+
|mapred.tasktracker.map.tasks.maximum | mapreduce.tasktracker.map.tasks.maximum
*---+---+
|mapred.tasktracker.memory_calculator_plugin | mapreduce.tasktracker.resourcecalculatorplugin
*---+---+
|mapred.tasktracker.memorycalculatorplugin | mapreduce.tasktracker.resourcecalculatorplugin
*---+---+
|mapred.tasktracker.reduce.tasks.maximum | mapreduce.tasktracker.reduce.tasks.maximum
*---+---+
|mapred.task.tracker.report.address | mapreduce.tasktracker.report.address
*---+---+
|mapred.task.tracker.task-controller | mapreduce.tasktracker.taskcontroller
*---+---+
|mapred.tasktracker.taskmemorymanager.monitoring-interval | mapreduce.tasktracker.taskmemorymanager.monitoringinterval
*---+---+
|mapred.tasktracker.tasks.sleeptime-before-sigkill | mapreduce.tasktracker.tasks.sleeptimebeforesigkill
*---+---+
|mapred.temp.dir | mapreduce.cluster.temp.dir
*---+---+
|mapred.text.key.comparator.options | mapreduce.partition.keycomparator.options
*---+---+
|mapred.text.key.partitioner.options | mapreduce.partition.keypartitioner.options
*---+---+
|mapred.textoutputformat.separator | mapreduce.output.textoutputformat.separator
*---+---+
|mapred.tip.id | mapreduce.task.id
*---+---+
|mapreduce.combine.class | mapreduce.job.combine.class
*---+---+
|mapreduce.inputformat.class | mapreduce.job.inputformat.class
*---+---+
|mapreduce.job.counters.limit | mapreduce.job.counters.max
*---+---+
|mapreduce.jobtracker.permissions.supergroup | mapreduce.cluster.permissions.supergroup
*---+---+
|mapreduce.map.class | mapreduce.job.map.class
*---+---+
|mapreduce.outputformat.class | mapreduce.job.outputformat.class
*---+---+
|mapreduce.partitioner.class | mapreduce.job.partitioner.class
*---+---+
|mapreduce.reduce.class | mapreduce.job.reduce.class
*---+---+
|mapred.used.genericoptionsparser | mapreduce.client.genericoptionsparser.used
*---+---+
|mapred.userlog.limit.kb | mapreduce.task.userlog.limit.kb
*---+---+
|mapred.userlog.retain.hours | mapreduce.job.userlog.retain.hours
*---+---+
|mapred.working.dir | mapreduce.job.working.dir
*---+---+
|mapred.work.output.dir | mapreduce.task.output.dir
*---+---+
|min.num.spills.for.combine | mapreduce.map.combine.minspills
*---+---+
|reduce.output.key.value.fields.spec | mapreduce.fieldsel.reduce.output.key.value.fields.spec
*---+---+
|security.job.submission.protocol.acl | security.job.client.protocol.acl
*---+---+
|security.task.umbilical.protocol.acl | security.job.task.protocol.acl
*---+---+
|sequencefile.filter.class | mapreduce.input.sequencefileinputfilter.class
*---+---+
|sequencefile.filter.frequency | mapreduce.input.sequencefileinputfilter.frequency
*---+---+
|sequencefile.filter.regex | mapreduce.input.sequencefileinputfilter.regex
*---+---+
|session.id | dfs.metrics.session-id
*---+---+
|slave.host.name | dfs.datanode.hostname
*---+---+
|slave.host.name | mapreduce.tasktracker.host.name
*---+---+
|tasktracker.contention.tracking | mapreduce.tasktracker.contention.tracking
*---+---+
|tasktracker.http.threads | mapreduce.tasktracker.http.threads
*---+---+
|topology.node.switch.mapping.impl | net.topology.node.switch.mapping.impl
*---+---+
|topology.script.file.name | net.topology.script.file.name
*---+---+
|topology.script.number.args | net.topology.script.number.args
*---+---+
|user.name | mapreduce.job.user.name
*---+---+
|webinterface.private.actions | mapreduce.jobtracker.webinterface.trusted
*---+---+
|yarn.app.mapreduce.yarn.app.mapreduce.client-am.ipc.max-retries-on-timeouts | yarn.app.mapreduce.client-am.ipc.max-retries-on-timeouts
*---+---+
The following table lists additional changes to some configuration properties:
*-------------------------------+-----------------------+
|| <<Deprecated property name>> || <<New property name>>|
*-------------------------------+-----------------------+
|mapred.create.symlink | NONE - symlinking is always on
*---+---+
|mapreduce.job.cache.symlink.create | NONE - symlinking is always on
*---+---+

View File

@ -1,691 +0,0 @@
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
---
File System Shell Guide
---
---
${maven.build.timestamp}
%{toc}
Overview
The File System (FS) shell includes various shell-like commands that
directly interact with the Hadoop Distributed File System (HDFS) as well as
other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS,
and others. The FS shell is invoked by:
+---
bin/hadoop fs <args>
+---
All FS shell commands take path URIs as arguments. The URI format is
<<<scheme://authority/path>>>. For HDFS the scheme is <<<hdfs>>>, and for
the Local FS the scheme is <<<file>>>. The scheme and authority are
optional. If not specified, the default scheme specified in the
configuration is used. An HDFS file or directory such as /parent/child can
be specified as <<<hdfs://namenodehost/parent/child>>> or simply as
<<</parent/child>>> (given that your configuration is set to point to
<<<hdfs://namenodehost>>>).
Most of the commands in FS shell behave like corresponding Unix commands.
Differences are described with each of the commands. Error information is
sent to stderr and the output is sent to stdout.
appendToFile
Usage: <<<hdfs dfs -appendToFile <localsrc> ... <dst> >>>
Append single src, or multiple srcs from local file system to the
destination file system. Also reads input from stdin and appends to
destination file system.
* <<<hdfs dfs -appendToFile localfile /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile>>>
Reads the input from stdin.
Exit Code:
Returns 0 on success and 1 on error.
cat
Usage: <<<hdfs dfs -cat URI [URI ...]>>>
Copies source paths to stdout.
Example:
* <<<hdfs dfs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hdfs dfs -cat file:///file3 /user/hadoop/file4>>>
Exit Code:
Returns 0 on success and -1 on error.
chgrp
Usage: <<<hdfs dfs -chgrp [-R] GROUP URI [URI ...]>>>
Change group association of files. The user must be the owner of files, or
else a super-user. Additional information is in the
{{{../hadoop-hdfs/HdfsPermissionsGuide.html}Permissions Guide}}.
Options
* The -R option will make the change recursively through the directory structure.
chmod
Usage: <<<hdfs dfs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]>>>
Change the permissions of files. With -R, make the change recursively
through the directory structure. The user must be the owner of the file, or
else a super-user. Additional information is in the
{{{../hadoop-hdfs/HdfsPermissionsGuide.html}Permissions Guide}}.
Options
* The -R option will make the change recursively through the directory structure.
chown
Usage: <<<hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]>>>
Change the owner of files. The user must be a super-user. Additional information
is in the {{{../hadoop-hdfs/HdfsPermissionsGuide.html}Permissions Guide}}.
Options
* The -R option will make the change recursively through the directory structure.
copyFromLocal
Usage: <<<hdfs dfs -copyFromLocal <localsrc> URI>>>
Similar to put command, except that the source is restricted to a local
file reference.
Options:
* The -f option will overwrite the destination if it already exists.
copyToLocal
Usage: <<<hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst> >>>
Similar to get command, except that the destination is restricted to a
local file reference.
count
Usage: <<<hdfs dfs -count [-q] [-h] <paths> >>>
Count the number of directories, files and bytes under the paths that match
the specified file pattern. The output columns with -count are: DIR_COUNT,
FILE_COUNT, CONTENT_SIZE FILE_NAME
The output columns with -count -q are: QUOTA, REMAINING_QUATA, SPACE_QUOTA,
REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, FILE_NAME
The -h option shows sizes in human readable format.
Example:
* <<<hdfs dfs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2>>>
* <<<hdfs dfs -count -q hdfs://nn1.example.com/file1>>>
* <<<hdfs dfs -count -q -h hdfs://nn1.example.com/file1>>>
Exit Code:
Returns 0 on success and -1 on error.
cp
Usage: <<<hdfs dfs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest> >>>
Copy files from source to destination. This command allows multiple sources
as well in which case the destination must be a directory.
'raw.*' namespace extended attributes are preserved if (1) the source and
destination filesystems support them (HDFS only), and (2) all source and
destination pathnames are in the /.reserved/raw hierarchy. Determination of
whether raw.* namespace xattrs are preserved is independent of the
-p (preserve) flag.
Options:
* The -f option will overwrite the destination if it already exists.
* The -p option will preserve file attributes [topx] (timestamps,
ownership, permission, ACL, XAttr). If -p is specified with no <arg>,
then preserves timestamps, ownership, permission. If -pa is specified,
then preserves permission also because ACL is a super-set of
permission. Determination of whether raw namespace extended attributes
are preserved is independent of the -p flag.
Example:
* <<<hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir>>>
Exit Code:
Returns 0 on success and -1 on error.
du
Usage: <<<hdfs dfs -du [-s] [-h] URI [URI ...]>>>
Displays sizes of files and directories contained in the given directory or
the length of a file in case its just a file.
Options:
* The -s option will result in an aggregate summary of file lengths being
displayed, rather than the individual files.
* The -h option will format file sizes in a "human-readable" fashion (e.g
64.0m instead of 67108864)
Example:
* hdfs dfs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1
Exit Code:
Returns 0 on success and -1 on error.
dus
Usage: <<<hdfs dfs -dus <args> >>>
Displays a summary of file lengths.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -du -s>>>.
expunge
Usage: <<<hdfs dfs -expunge>>>
Empty the Trash. Refer to the {{{../hadoop-hdfs/HdfsDesign.html}
HDFS Architecture Guide}} for more information on the Trash feature.
find
Usage: <<<hdfs dfs -find <path> ... <expression> ... >>>
Finds all files that match the specified expression and applies selected
actions to them. If no <path> is specified then defaults to the current
working directory. If no expression is specified then defaults to -print.
The following primary expressions are recognised:
* -name pattern \
-iname pattern
Evaluates as true if the basename of the file matches the pattern using
standard file system globbing. If -iname is used then the match is case
insensitive.
* -print \
-print0
Always evaluates to true. Causes the current pathname to be written to
standard output. If the -print0 expression is used then an ASCII NULL
character is appended.
The following operators are recognised:
* expression -a expression \
expression -and expression \
expression expression
Logical AND operator for joining two expressions. Returns true if both
child expressions return true. Implied by the juxtaposition of two
expressions and so does not need to be explicitly specified. The second
expression will not be applied if the first fails.
Example:
<<<hdfs dfs -find / -name test -print>>>
Exit Code:
Returns 0 on success and -1 on error.
get
Usage: <<<hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst> >>>
Copy files to the local file system. Files that fail the CRC check may be
copied with the -ignorecrc option. Files and CRCs may be copied using the
-crc option.
Example:
* <<<hdfs dfs -get /user/hadoop/file localfile>>>
* <<<hdfs dfs -get hdfs://nn.example.com/user/hadoop/file localfile>>>
Exit Code:
Returns 0 on success and -1 on error.
getfacl
Usage: <<<hdfs dfs -getfacl [-R] <path> >>>
Displays the Access Control Lists (ACLs) of files and directories. If a
directory has a default ACL, then getfacl also displays the default ACL.
Options:
* -R: List the ACLs of all files and directories recursively.
* <path>: File or directory to list.
Examples:
* <<<hdfs dfs -getfacl /file>>>
* <<<hdfs dfs -getfacl -R /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
getfattr
Usage: <<<hdfs dfs -getfattr [-R] {-n name | -d} [-e en] <path> >>>
Displays the extended attribute names and values (if any) for a file or
directory.
Options:
* -R: Recursively list the attributes for all files and directories.
* -n name: Dump the named extended attribute value.
* -d: Dump all extended attribute values associated with pathname.
* -e <encoding>: Encode values after retrieving them. Valid encodings are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
* <path>: The file or directory.
Examples:
* <<<hdfs dfs -getfattr -d /file>>>
* <<<hdfs dfs -getfattr -R -n user.myAttr /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
getmerge
Usage: <<<hdfs dfs -getmerge <src> <localdst> [addnl]>>>
Takes a source directory and a destination file as input and concatenates
files in src into the destination local file. Optionally addnl can be set to
enable adding a newline character at the
end of each file.
ls
Usage: <<<hdfs dfs -ls [-R] <args> >>>
Options:
* The -R option will return stat recursively through the directory
structure.
For a file returns stat on the file with the following format:
+---+
permissions number_of_replicas userid groupid filesize modification_date modification_time filename
+---+
For a directory it returns list of its direct children as in Unix. A directory is listed as:
+---+
permissions userid groupid modification_date modification_time dirname
+---+
Example:
* <<<hdfs dfs -ls /user/hadoop/file1>>>
Exit Code:
Returns 0 on success and -1 on error.
lsr
Usage: <<<hdfs dfs -lsr <args> >>>
Recursive version of ls.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -ls -R>>>
mkdir
Usage: <<<hdfs dfs -mkdir [-p] <paths> >>>
Takes path uri's as argument and creates directories.
Options:
* The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
Example:
* <<<hdfs dfs -mkdir /user/hadoop/dir1 /user/hadoop/dir2>>>
* <<<hdfs dfs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir>>>
Exit Code:
Returns 0 on success and -1 on error.
moveFromLocal
Usage: <<<hdfs dfs -moveFromLocal <localsrc> <dst> >>>
Similar to put command, except that the source localsrc is deleted after
it's copied.
moveToLocal
Usage: <<<hdfs dfs -moveToLocal [-crc] <src> <dst> >>>
Displays a "Not implemented yet" message.
mv
Usage: <<<hdfs dfs -mv URI [URI ...] <dest> >>>
Moves files from source to destination. This command allows multiple sources
as well in which case the destination needs to be a directory. Moving files
across file systems is not permitted.
Example:
* <<<hdfs dfs -mv /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hdfs dfs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1>>>
Exit Code:
Returns 0 on success and -1 on error.
put
Usage: <<<hdfs dfs -put <localsrc> ... <dst> >>>
Copy single src, or multiple srcs from local file system to the destination
file system. Also reads input from stdin and writes to destination file
system.
* <<<hdfs dfs -put localfile /user/hadoop/hadoopfile>>>
* <<<hdfs dfs -put localfile1 localfile2 /user/hadoop/hadoopdir>>>
* <<<hdfs dfs -put localfile hdfs://nn.example.com/hadoop/hadoopfile>>>
* <<<hdfs dfs -put - hdfs://nn.example.com/hadoop/hadoopfile>>>
Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
rm
Usage: <<<hdfs dfs -rm [-f] [-r|-R] [-skipTrash] URI [URI ...]>>>
Delete files specified as args.
Options:
* The -f option will not display a diagnostic message or modify the exit
status to reflect an error if the file does not exist.
* The -R option deletes the directory and any content under it recursively.
* The -r option is equivalent to -R.
* The -skipTrash option will bypass trash, if enabled, and delete the
specified file(s) immediately. This can be useful when it is necessary
to delete files from an over-quota directory.
Example:
* <<<hdfs dfs -rm hdfs://nn.example.com/file /user/hadoop/emptydir>>>
Exit Code:
Returns 0 on success and -1 on error.
rmr
Usage: <<<hdfs dfs -rmr [-skipTrash] URI [URI ...]>>>
Recursive version of delete.
<<Note:>> This command is deprecated. Instead use <<<hdfs dfs -rm -r>>>
setfacl
Usage: <<<hdfs dfs -setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>] >>>
Sets Access Control Lists (ACLs) of files and directories.
Options:
* -b: Remove all but the base ACL entries. The entries for user, group and
others are retained for compatibility with permission bits.
* -k: Remove the default ACL.
* -R: Apply operations to all files and directories recursively.
* -m: Modify ACL. New entries are added to the ACL, and existing entries
are retained.
* -x: Remove specified ACL entries. Other ACL entries are retained.
* --set: Fully replace the ACL, discarding all existing entries. The
<acl_spec> must include entries for user, group, and others for
compatibility with permission bits.
* <acl_spec>: Comma separated list of ACL entries.
* <path>: File or directory to modify.
Examples:
* <<<hdfs dfs -setfacl -m user:hadoop:rw- /file>>>
* <<<hdfs dfs -setfacl -x user:hadoop /file>>>
* <<<hdfs dfs -setfacl -b /file>>>
* <<<hdfs dfs -setfacl -k /dir>>>
* <<<hdfs dfs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file>>>
* <<<hdfs dfs -setfacl -R -m user:hadoop:r-x /dir>>>
* <<<hdfs dfs -setfacl -m default:user:hadoop:r-x /dir>>>
Exit Code:
Returns 0 on success and non-zero on error.
setfattr
Usage: <<<hdfs dfs -setfattr {-n name [-v value] | -x name} <path> >>>
Sets an extended attribute name and value for a file or directory.
Options:
* -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
* -n name: The extended attribute name.
* -v value: The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
* -x name: Remove the extended attribute.
* <path>: The file or directory.
Examples:
* <<<hdfs dfs -setfattr -n user.myAttr -v myValue /file>>>
* <<<hdfs dfs -setfattr -n user.noValue /file>>>
* <<<hdfs dfs -setfattr -x user.myAttr /file>>>
Exit Code:
Returns 0 on success and non-zero on error.
setrep
Usage: <<<hdfs dfs -setrep [-R] [-w] <numReplicas> <path> >>>
Changes the replication factor of a file. If <path> is a directory then
the command recursively changes the replication factor of all files under
the directory tree rooted at <path>.
Options:
* The -w flag requests that the command wait for the replication
to complete. This can potentially take a very long time.
* The -R flag is accepted for backwards compatibility. It has no effect.
Example:
* <<<hdfs dfs -setrep -w 3 /user/hadoop/dir1>>>
Exit Code:
Returns 0 on success and -1 on error.
stat
Usage: <<<hdfs dfs -stat [format] \<path\> ...>>>
Print statistics about the file/directory at \<path\> in the specified
format. Format accepts filesize in blocks (%b), type (%F), group name of
owner (%g), name (%n), block size (%o), replication (%r), user name of
owner(%u), and modification date (%y, %Y). %y shows UTC date as
"yyyy-MM-dd HH:mm:ss" and %Y shows milliseconds since January 1, 1970 UTC.
If the format is not specified, %y is used by default.
Example:
* <<<hdfs dfs -stat "%F %u:%g %b %y %n" /file>>>
Exit Code:
Returns 0 on success and -1 on error.
tail
Usage: <<<hdfs dfs -tail [-f] URI>>>
Displays last kilobyte of the file to stdout.
Options:
* The -f option will output appended data as the file grows, as in Unix.
Example:
* <<<hdfs dfs -tail pathname>>>
Exit Code:
Returns 0 on success and -1 on error.
test
Usage: <<<hdfs dfs -test -[ezd] URI>>>
Options:
* The -e option will check to see if the file exists, returning 0 if true.
* The -z option will check to see if the file is zero length, returning 0 if true.
* The -d option will check to see if the path is directory, returning 0 if true.
Example:
* <<<hdfs dfs -test -e filename>>>
text
Usage: <<<hdfs dfs -text <src> >>>
Takes a source file and outputs the file in text format. The allowed formats
are zip and TextRecordInputStream.
touchz
Usage: <<<hdfs dfs -touchz URI [URI ...]>>>
Create a file of zero length.
Example:
* <<<hdfs dfs -touchz pathname>>>
Exit Code:
Returns 0 on success and -1 on error.
* truncate
Usage: <<<hadoop fs -truncate [-w] <length> <paths> >>>
Truncate all files that match the specified file pattern to the
specified length.
Options:
* The -w flag requests that the command waits for block recovery
to complete, if necessary. Without -w flag the file may remain
unclosed for some time while the recovery is in progress.
During this time file cannot be reopened for append.
Example:
* <<<hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2>>>
* <<<hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1>>>

View File

@ -1,98 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Authentication for Hadoop HTTP web-consoles
---
---
${maven.build.timestamp}
Authentication for Hadoop HTTP web-consoles
%{toc|section=1|fromDepth=0}
* Introduction
This document describes how to configure Hadoop HTTP web-consoles to
require user authentication.
By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers
and DataNodes) allow access without any form of authentication.
Similarly to Hadoop RPC, Hadoop HTTP web-consoles can be configured to
require Kerberos authentication using HTTP SPNEGO protocol (supported
by browsers like Firefox and Internet Explorer).
In addition, Hadoop HTTP web-consoles support the equivalent of
Hadoop's Pseudo/Simple authentication. If this option is enabled, user
must specify their user name in the first browser interaction using the
user.name query string parameter. For example:
<<<http://localhost:50030/jobtracker.jsp?user.name=babu>>>.
If a custom authentication mechanism is required for the HTTP
web-consoles, it is possible to implement a plugin to support the
alternate authentication mechanism (refer to Hadoop hadoop-auth for details
on writing an <<<AuthenticatorHandler>>>).
The next section describes how to configure Hadoop HTTP web-consoles to
require user authentication.
* Configuration
The following properties should be in the <<<core-site.xml>>> of all the
nodes in the cluster.
<<<hadoop.http.filter.initializers>>>: add to this property the
<<<org.apache.hadoop.security.AuthenticationFilterInitializer>>> initializer
class.
<<<hadoop.http.authentication.type>>>: Defines authentication used for the
HTTP web-consoles. The supported values are: <<<simple>>> | <<<kerberos>>> |
<<<#AUTHENTICATION_HANDLER_CLASSNAME#>>>. The dfeault value is <<<simple>>>.
<<<hadoop.http.authentication.token.validity>>>: Indicates how long (in
seconds) an authentication token is valid before it has to be renewed.
The default value is <<<36000>>>.
<<<hadoop.http.authentication.signature.secret.file>>>: The signature secret
file for signing the authentication tokens. The same secret should be used
for all nodes in the cluster, JobTracker, NameNode, DataNode and TastTracker.
The default value is <<<${user.home}/hadoop-http-auth-signature-secret>>>.
IMPORTANT: This file should be readable only by the Unix user running the
daemons.
<<<hadoop.http.authentication.cookie.domain>>>: The domain to use for the
HTTP cookie that stores the authentication token. In order to
authentiation to work correctly across all nodes in the cluster the
domain must be correctly set. There is no default value, the HTTP
cookie will not have a domain working only with the hostname issuing
the HTTP cookie.
IMPORTANT: when using IP addresses, browsers ignore cookies with domain
settings. For this setting to work properly all nodes in the cluster
must be configured to generate URLs with <<<hostname.domain>>> names on it.
<<<hadoop.http.authentication.simple.anonymous.allowed>>>: Indicates if
anonymous requests are allowed when using 'simple' authentication. The
default value is <<<true>>>
<<<hadoop.http.authentication.kerberos.principal>>>: Indicates the Kerberos
principal to be used for HTTP endpoint when using 'kerberos'
authentication. The principal short name must be <<<HTTP>>> per Kerberos HTTP
SPNEGO specification. The default value is <<<HTTP/_HOST@$LOCALHOST>>>,
where <<<_HOST>>> -if present- is replaced with bind address of the HTTP
server.
<<<hadoop.http.authentication.kerberos.keytab>>>: Location of the keytab file
with the credentials for the Kerberos principal used for the HTTP
endpoint. The default value is <<<${user.home}/hadoop.keytab>>>.i

View File

@ -1,239 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Interface Taxonomy: Audience and Stability Classification
---
---
${maven.build.timestamp}
Hadoop Interface Taxonomy: Audience and Stability Classification
%{toc|section=1|fromDepth=0}
* Motivation
The interface taxonomy classification provided here is for guidance to
developers and users of interfaces. The classification guides a developer
to declare the targeted audience or users of an interface and also its
stability.
* Benefits to the user of an interface: Knows which interfaces to use or not
use and their stability.
* Benefits to the developer: to prevent accidental changes of interfaces and
hence accidental impact on users or other components or system. This is
particularly useful in large systems with many developers who may not all
have a shared state/history of the project.
* Interface Classification
Hadoop adopts the following interface classification,
this classification was derived from the
{{{http://www.opensolaris.org/os/community/arc/policies/interface-taxonomy/#Advice}OpenSolaris taxonomy}}
and, to some extent, from taxonomy used inside Yahoo. Interfaces have two main
attributes: Audience and Stability
** Audience
Audience denotes the potential consumers of the interface. While many
interfaces are internal/private to the implementation,
other are public/external interfaces are meant for wider consumption by
applications and/or clients. For example, in posix, libc is an external or
public interface, while large parts of the kernel are internal or private
interfaces. Also, some interfaces are targeted towards other specific
subsystems.
Identifying the audience of an interface helps define the impact of
breaking it. For instance, it might be okay to break the compatibility of
an interface whose audience is a small number of specific subsystems. On
the other hand, it is probably not okay to break a protocol interfaces
that millions of Internet users depend on.
Hadoop uses the following kinds of audience in order of
increasing/wider visibility:
* Private:
* The interface is for internal use within the project (such as HDFS or
MapReduce) and should not be used by applications or by other projects. It
is subject to change at anytime without notice. Most interfaces of a
project are Private (also referred to as project-private).
* Limited-Private:
* The interface is used by a specified set of projects or systems
(typically closely related projects). Other projects or systems should not
use the interface. Changes to the interface will be communicated/
negotiated with the specified projects. For example, in the Hadoop project,
some interfaces are LimitedPrivate\{HDFS, MapReduce\} in that they
are private to the HDFS and MapReduce projects.
* Public
* The interface is for general use by any application.
Hadoop doesn't have a Company-Private classification,
which is meant for APIs which are intended to be used by other projects
within the company, since it doesn't apply to opensource projects. Also,
certain APIs are annotated as @VisibleForTesting (from com.google.common
.annotations.VisibleForTesting) - these are meant to be used strictly for
unit tests and should be treated as "Private" APIs.
** Stability
Stability denotes how stable an interface is, as in when incompatible
changes to the interface are allowed. Hadoop APIs have the following
levels of stability.
* Stable
* Can evolve while retaining compatibility for minor release boundaries;
in other words, incompatible changes to APIs marked Stable are allowed
only at major releases (i.e. at m.0).
* Evolving
* Evolving, but incompatible changes are allowed at minor release (i.e. m
.x)
* Unstable
* Incompatible changes to Unstable APIs are allowed any time. This
usually makes sense for only private interfaces.
* However one may call this out for a supposedly public interface to
highlight that it should not be used as an interface; for public
interfaces, labeling it as Not-an-interface is probably more appropriate
than "Unstable".
* Examples of publicly visible interfaces that are unstable (i.e.
not-an-interface): GUI, CLIs whose output format will change
* Deprecated
* APIs that could potentially removed in the future and should not be
used.
* How are the Classifications Recorded?
How will the classification be recorded for Hadoop APIs?
* Each interface or class will have the audience and stability recorded
using annotations in org.apache.hadoop.classification package.
* The javadoc generated by the maven target javadoc:javadoc lists only the
public API.
* One can derive the audience of java classes and java interfaces by the
audience of the package in which they are contained. Hence it is useful to
declare the audience of each java package as public or private (along with
the private audience variations).
* FAQ
* Why arent the java scopes (private, package private and public) good
enough?
* Javas scoping is not very complete. One is often forced to make a class
public in order for other internal components to use it. It does not have
friends or sub-package-private like C++.
* But I can easily access a private implementation interface if it is Java
public. Where is the protection and control?
* The purpose of this is not providing absolute access control. Its purpose
is to communicate to users and developers. One can access private
implementation functions in libc; however if they change the internal
implementation details, your application will break and you will have little
sympathy from the folks who are supplying libc. If you use a non-public
interface you understand the risks.
* Why bother declaring the stability of a private interface? Arent private
interfaces always unstable?
* Private interfaces are not always unstable. In the cases where they are
stable they capture internal properties of the system and can communicate
these properties to its internal users and to developers of the interface.
* e.g. In HDFS, NN-DN protocol is private but stable and can help
implement rolling upgrades. It communicates that this interface should not
be changed in incompatible ways even though it is private.
* e.g. In HDFS, FSImage stability can help provide more flexible roll
backs.
* What is the harm in applications using a private interface that is
stable? How is it different than a public stable interface?
* While a private interface marked as stable is targeted to change only at
major releases, it may break at other times if the providers of that
interface are willing to changes the internal users of that interface.
Further, a public stable interface is less likely to break even at major
releases (even though it is allowed to break compatibility) because the
impact of the change is larger. If you use a private interface (regardless
of its stability) you run the risk of incompatibility.
* Why bother with Limited-private? Isnt it giving special treatment to some
projects? That is not fair.
* First, most interfaces should be public or private; actually let us state
it even stronger: make it private unless you really want to expose it to
public for general use.
* Limited-private is for interfaces that are not intended for general use.
They are exposed to related projects that need special hooks. Such a
classification has a cost to both the supplier and consumer of the limited
interface. Both will have to work together if ever there is a need to break
the interface in the future; for example the supplier and the consumers will
have to work together to get coordinated releases of their respective
projects. This should not be taken lightly if you can get away with
private then do so; if the interface is really for general use for all
applications then do so. But remember that making an interface public has
huge responsibility. Sometimes Limited-private is just right.
* A good example of a limited-private interface is BlockLocations, This is
fairly low-level interface that we are willing to expose to MR and perhaps
HBase. We are likely to change it down the road and at that time we will
have get a coordinated effort with the MR team to release matching releases.
While MR and HDFS are always released in sync today, they may change down
the road.
* If you have a limited-private interface with many projects listed then
you are fooling yourself. It is practically public.
* It might be worth declaring a special audience classification called
Hadoop-Private for the Hadoop family.
* Lets treat all private interfaces as Hadoop-private. What is the harm in
projects in the Hadoop family have access to private classes?
* Do we want MR accessing class files that are implementation details
inside HDFS. There used to be many such layer violations in the code that
we have been cleaning up over the last few years. We dont want such
layer violations to creep back in by no separating between the major
components like HDFS and MR.
* Aren't all public interfaces stable?
* One may mark a public interface as evolving in its early days.
Here one is promising to make an effort to make compatible changes but may
need to break it at minor releases.
* One example of a public interface that is unstable is where one is providing
an implementation of a standards-body based interface that is still under development.
For example, many companies, in an attampt to be first to market,
have provided implementations of a new NFS protocol even when the protocol was not
fully completed by IETF.
The implementor cannot evolve the interface in a fashion that causes least distruption
because the stability is controlled by the standards body. Hence it is appropriate to
label the interface as unstable.

View File

@ -1,889 +0,0 @@
~~ Licensed to the Apache Software Foundation (ASF) under one or more
~~ contributor license agreements. See the NOTICE file distributed with
~~ this work for additional information regarding copyright ownership.
~~ The ASF licenses this file to You under the Apache License, Version 2.0
~~ (the "License"); you may not use this file except in compliance with
~~ the License. You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License.
---
Metrics Guide
---
---
${maven.build.timestamp}
%{toc}
Overview
Metrics are statistical information exposed by Hadoop daemons,
used for monitoring, performance tuning and debug.
There are many metrics available by default
and they are very useful for troubleshooting.
This page shows the details of the available metrics.
Each section describes each context into which metrics are grouped.
The documentation of Metrics 2.0 framework is
{{{../../api/org/apache/hadoop/metrics2/package-summary.html}here}}.
jvm context
* JvmMetrics
Each metrics record contains tags such as ProcessName, SessionID
and Hostname as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<MemNonHeapUsedM>>> | Current non-heap memory used in MB
*-------------------------------------+--------------------------------------+
|<<<MemNonHeapCommittedM>>> | Current non-heap memory committed in MB
*-------------------------------------+--------------------------------------+
|<<<MemNonHeapMaxM>>> | Max non-heap memory size in MB
*-------------------------------------+--------------------------------------+
|<<<MemHeapUsedM>>> | Current heap memory used in MB
*-------------------------------------+--------------------------------------+
|<<<MemHeapCommittedM>>> | Current heap memory committed in MB
*-------------------------------------+--------------------------------------+
|<<<MemHeapMaxM>>> | Max heap memory size in MB
*-------------------------------------+--------------------------------------+
|<<<MemMaxM>>> | Max memory size in MB
*-------------------------------------+--------------------------------------+
|<<<ThreadsNew>>> | Current number of NEW threads
*-------------------------------------+--------------------------------------+
|<<<ThreadsRunnable>>> | Current number of RUNNABLE threads
*-------------------------------------+--------------------------------------+
|<<<ThreadsBlocked>>> | Current number of BLOCKED threads
*-------------------------------------+--------------------------------------+
|<<<ThreadsWaiting>>> | Current number of WAITING threads
*-------------------------------------+--------------------------------------+
|<<<ThreadsTimedWaiting>>> | Current number of TIMED_WAITING threads
*-------------------------------------+--------------------------------------+
|<<<ThreadsTerminated>>> | Current number of TERMINATED threads
*-------------------------------------+--------------------------------------+
|<<<GcInfo>>> | Total GC count and GC time in msec, grouped by the kind of GC. \
| ex.) GcCountPS Scavenge=6, GCTimeMillisPS Scavenge=40,
| GCCountPS MarkSweep=0, GCTimeMillisPS MarkSweep=0
*-------------------------------------+--------------------------------------+
|<<<GcCount>>> | Total GC count
*-------------------------------------+--------------------------------------+
|<<<GcTimeMillis>>> | Total GC time in msec
*-------------------------------------+--------------------------------------+
|<<<LogFatal>>> | Total number of FATAL logs
*-------------------------------------+--------------------------------------+
|<<<LogError>>> | Total number of ERROR logs
*-------------------------------------+--------------------------------------+
|<<<LogWarn>>> | Total number of WARN logs
*-------------------------------------+--------------------------------------+
|<<<LogInfo>>> | Total number of INFO logs
*-------------------------------------+--------------------------------------+
|<<<GcNumWarnThresholdExceeded>>> | Number of times that the GC warn
| threshold is exceeded
*-------------------------------------+--------------------------------------+
|<<<GcNumInfoThresholdExceeded>>> | Number of times that the GC info
| threshold is exceeded
*-------------------------------------+--------------------------------------+
|<<<GcTotalExtraSleepTime>>> | Total GC extra sleep time in msec
*-------------------------------------+--------------------------------------+
rpc context
* rpc
Each metrics record contains tags such as Hostname
and port (number to which server is bound)
as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<ReceivedBytes>>> | Total number of received bytes
*-------------------------------------+--------------------------------------+
|<<<SentBytes>>> | Total number of sent bytes
*-------------------------------------+--------------------------------------+
|<<<RpcQueueTimeNumOps>>> | Total number of RPC calls
*-------------------------------------+--------------------------------------+
|<<<RpcQueueTimeAvgTime>>> | Average queue time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<RpcProcessingTimeNumOps>>> | Total number of RPC calls (same to
| RpcQueueTimeNumOps)
*-------------------------------------+--------------------------------------+
|<<<RpcProcessingAvgTime>>> | Average Processing time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<RpcAuthenticationFailures>>> | Total number of authentication failures
*-------------------------------------+--------------------------------------+
|<<<RpcAuthenticationSuccesses>>> | Total number of authentication successes
*-------------------------------------+--------------------------------------+
|<<<RpcAuthorizationFailures>>> | Total number of authorization failures
*-------------------------------------+--------------------------------------+
|<<<RpcAuthorizationSuccesses>>> | Total number of authorization successes
*-------------------------------------+--------------------------------------+
|<<<NumOpenConnections>>> | Current number of open connections
*-------------------------------------+--------------------------------------+
|<<<CallQueueLength>>> | Current length of the call queue
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<sNumOps>>> | Shows total number of RPC calls
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<s50thPercentileLatency>>> |
| | Shows the 50th percentile of RPC queue time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<s75thPercentileLatency>>> |
| | Shows the 75th percentile of RPC queue time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<s90thPercentileLatency>>> |
| | Shows the 90th percentile of RPC queue time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<s95thPercentileLatency>>> |
| | Shows the 95th percentile of RPC queue time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcQueueTime>>><num><<<s99thPercentileLatency>>> |
| | Shows the 99th percentile of RPC queue time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<sNumOps>>> | Shows total number of RPC calls
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<s50thPercentileLatency>>> |
| | Shows the 50th percentile of RPC processing time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<s75thPercentileLatency>>> |
| | Shows the 75th percentile of RPC processing time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<s90thPercentileLatency>>> |
| | Shows the 90th percentile of RPC processing time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<s95thPercentileLatency>>> |
| | Shows the 95th percentile of RPC processing time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<rpcProcessingTime>>><num><<<s99thPercentileLatency>>> |
| | Shows the 99th percentile of RPC processing time in milliseconds
| | (<num> seconds granularity) if <<<rpc.metrics.quantile.enable>>> is set to
| | true. <num> is specified by <<<rpc.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
* RetryCache/NameNodeRetryCache
RetryCache metrics is useful to monitor NameNode fail-over.
Each metrics record contains Hostname tag.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<CacheHit>>> | Total number of RetryCache hit
*-------------------------------------+--------------------------------------+
|<<<CacheCleared>>> | Total number of RetryCache cleared
*-------------------------------------+--------------------------------------+
|<<<CacheUpdated>>> | Total number of RetryCache updated
*-------------------------------------+--------------------------------------+
rpcdetailed context
Metrics of rpcdetailed context are exposed in unified manner by RPC
layer. Two metrics are exposed for each RPC based on its name.
Metrics named "(RPC method name)NumOps" indicates total number of
method calls, and metrics named "(RPC method name)AvgTime" shows
average turn around time for method calls in milliseconds.
* rpcdetailed
Each metrics record contains tags such as Hostname
and port (number to which server is bound)
as additional information along with metrics.
The Metrics about RPCs which is not called are not included
in metrics record.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<methodname><<<NumOps>>> | Total number of the times the method is called
*-------------------------------------+--------------------------------------+
|<methodname><<<AvgTime>>> | Average turn around time of the method in
| milliseconds
*-------------------------------------+--------------------------------------+
dfs context
* namenode
Each metrics record contains tags such as ProcessName, SessionId,
and Hostname as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<CreateFileOps>>> | Total number of files created
*-------------------------------------+--------------------------------------+
|<<<FilesCreated>>> | Total number of files and directories created by create
| or mkdir operations
*-------------------------------------+--------------------------------------+
|<<<FilesAppended>>> | Total number of files appended
*-------------------------------------+--------------------------------------+
|<<<GetBlockLocations>>> | Total number of getBlockLocations operations
*-------------------------------------+--------------------------------------+
|<<<FilesRenamed>>> | Total number of rename <<operations>> (NOT number of
| files/dirs renamed)
*-------------------------------------+--------------------------------------+
|<<<GetListingOps>>> | Total number of directory listing operations
*-------------------------------------+--------------------------------------+
|<<<DeleteFileOps>>> | Total number of delete operations
*-------------------------------------+--------------------------------------+
|<<<FilesDeleted>>> | Total number of files and directories deleted by delete
| or rename operations
*-------------------------------------+--------------------------------------+
|<<<FileInfoOps>>> | Total number of getFileInfo and getLinkFileInfo
| operations
*-------------------------------------+--------------------------------------+
|<<<AddBlockOps>>> | Total number of addBlock operations succeeded
*-------------------------------------+--------------------------------------+
|<<<GetAdditionalDatanodeOps>>> | Total number of getAdditionalDatanode
| operations
*-------------------------------------+--------------------------------------+
|<<<CreateSymlinkOps>>> | Total number of createSymlink operations
*-------------------------------------+--------------------------------------+
|<<<GetLinkTargetOps>>> | Total number of getLinkTarget operations
*-------------------------------------+--------------------------------------+
|<<<FilesInGetListingOps>>> | Total number of files and directories listed by
| directory listing operations
*-------------------------------------+--------------------------------------+
|<<<AllowSnapshotOps>>> | Total number of allowSnapshot operations
*-------------------------------------+--------------------------------------+
|<<<DisallowSnapshotOps>>> | Total number of disallowSnapshot operations
*-------------------------------------+--------------------------------------+
|<<<CreateSnapshotOps>>> | Total number of createSnapshot operations
*-------------------------------------+--------------------------------------+
|<<<DeleteSnapshotOps>>> | Total number of deleteSnapshot operations
*-------------------------------------+--------------------------------------+
|<<<RenameSnapshotOps>>> | Total number of renameSnapshot operations
*-------------------------------------+--------------------------------------+
|<<<ListSnapshottableDirOps>>> | Total number of snapshottableDirectoryStatus
| operations
*-------------------------------------+--------------------------------------+
|<<<SnapshotDiffReportOps>>> | Total number of getSnapshotDiffReport
| operations
*-------------------------------------+--------------------------------------+
|<<<TransactionsNumOps>>> | Total number of Journal transactions
*-------------------------------------+--------------------------------------+
|<<<TransactionsAvgTime>>> | Average time of Journal transactions in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<SyncsNumOps>>> | Total number of Journal syncs
*-------------------------------------+--------------------------------------+
|<<<SyncsAvgTime>>> | Average time of Journal syncs in milliseconds
*-------------------------------------+--------------------------------------+
|<<<TransactionsBatchedInSync>>> | Total number of Journal transactions batched
| in sync
*-------------------------------------+--------------------------------------+
|<<<BlockReportNumOps>>> | Total number of processing block reports from
| DataNode
*-------------------------------------+--------------------------------------+
|<<<BlockReportAvgTime>>> | Average time of processing block reports in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<CacheReportNumOps>>> | Total number of processing cache reports from
| DataNode
*-------------------------------------+--------------------------------------+
|<<<CacheReportAvgTime>>> | Average time of processing cache reports in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<SafeModeTime>>> | The interval between FSNameSystem starts and the last
| time safemode leaves in milliseconds. \
| (sometimes not equal to the time in SafeMode,
| see {{{https://issues.apache.org/jira/browse/HDFS-5156}HDFS-5156}})
*-------------------------------------+--------------------------------------+
|<<<FsImageLoadTime>>> | Time loading FS Image at startup in milliseconds
*-------------------------------------+--------------------------------------+
|<<<FsImageLoadTime>>> | Time loading FS Image at startup in milliseconds
*-------------------------------------+--------------------------------------+
|<<<GetEditNumOps>>> | Total number of edits downloads from SecondaryNameNode
*-------------------------------------+--------------------------------------+
|<<<GetEditAvgTime>>> | Average edits download time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<GetImageNumOps>>> |Total number of fsimage downloads from SecondaryNameNode
*-------------------------------------+--------------------------------------+
|<<<GetImageAvgTime>>> | Average fsimage download time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<PutImageNumOps>>> | Total number of fsimage uploads to SecondaryNameNode
*-------------------------------------+--------------------------------------+
|<<<PutImageAvgTime>>> | Average fsimage upload time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<TotalFileOps>>> | Total number of file operations performed
*-------------------------------------+--------------------------------------+
* FSNamesystem
Each metrics record contains tags such as HAState and Hostname
as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<MissingBlocks>>> | Current number of missing blocks
*-------------------------------------+--------------------------------------+
|<<<ExpiredHeartbeats>>> | Total number of expired heartbeats
*-------------------------------------+--------------------------------------+
|<<<TransactionsSinceLastCheckpoint>>> | Total number of transactions since
| last checkpoint
*-------------------------------------+--------------------------------------+
|<<<TransactionsSinceLastLogRoll>>> | Total number of transactions since last
| edit log roll
*-------------------------------------+--------------------------------------+
|<<<LastWrittenTransactionId>>> | Last transaction ID written to the edit log
*-------------------------------------+--------------------------------------+
|<<<LastCheckpointTime>>> | Time in milliseconds since epoch of last checkpoint
*-------------------------------------+--------------------------------------+
|<<<CapacityTotal>>> | Current raw capacity of DataNodes in bytes
*-------------------------------------+--------------------------------------+
|<<<CapacityTotalGB>>> | Current raw capacity of DataNodes in GB
*-------------------------------------+--------------------------------------+
|<<<CapacityUsed>>> | Current used capacity across all DataNodes in bytes
*-------------------------------------+--------------------------------------+
|<<<CapacityUsedGB>>> | Current used capacity across all DataNodes in GB
*-------------------------------------+--------------------------------------+
|<<<CapacityRemaining>>> | Current remaining capacity in bytes
*-------------------------------------+--------------------------------------+
|<<<CapacityRemainingGB>>> | Current remaining capacity in GB
*-------------------------------------+--------------------------------------+
|<<<CapacityUsedNonDFS>>> | Current space used by DataNodes for non DFS
| purposes in bytes
*-------------------------------------+--------------------------------------+
|<<<TotalLoad>>> | Current number of connections
*-------------------------------------+--------------------------------------+
|<<<SnapshottableDirectories>>> | Current number of snapshottable directories
*-------------------------------------+--------------------------------------+
|<<<Snapshots>>> | Current number of snapshots
*-------------------------------------+--------------------------------------+
|<<<BlocksTotal>>> | Current number of allocated blocks in the system
*-------------------------------------+--------------------------------------+
|<<<FilesTotal>>> | Current number of files and directories
*-------------------------------------+--------------------------------------+
|<<<PendingReplicationBlocks>>> | Current number of blocks pending to be
| replicated
*-------------------------------------+--------------------------------------+
|<<<UnderReplicatedBlocks>>> | Current number of blocks under replicated
*-------------------------------------+--------------------------------------+
|<<<CorruptBlocks>>> | Current number of blocks with corrupt replicas.
*-------------------------------------+--------------------------------------+
|<<<ScheduledReplicationBlocks>>> | Current number of blocks scheduled for
| replications
*-------------------------------------+--------------------------------------+
|<<<PendingDeletionBlocks>>> | Current number of blocks pending deletion
*-------------------------------------+--------------------------------------+
|<<<ExcessBlocks>>> | Current number of excess blocks
*-------------------------------------+--------------------------------------+
|<<<PostponedMisreplicatedBlocks>>> | (HA-only) Current number of blocks
| postponed to replicate
*-------------------------------------+--------------------------------------+
|<<<PendingDataNodeMessageCourt>>> | (HA-only) Current number of pending
| block-related messages for later
| processing in the standby NameNode
*-------------------------------------+--------------------------------------+
|<<<MillisSinceLastLoadedEdits>>> | (HA-only) Time in milliseconds since the
| last time standby NameNode load edit log.
| In active NameNode, set to 0
*-------------------------------------+--------------------------------------+
|<<<BlockCapacity>>> | Current number of block capacity
*-------------------------------------+--------------------------------------+
|<<<StaleDataNodes>>> | Current number of DataNodes marked stale due to delayed
| heartbeat
*-------------------------------------+--------------------------------------+
|<<<TotalFiles>>> |Current number of files and directories (same as FilesTotal)
*-------------------------------------+--------------------------------------+
* JournalNode
The server-side metrics for a journal from the JournalNode's perspective.
Each metrics record contains Hostname tag as additional information
along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<Syncs60sNumOps>>> | Number of sync operations (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs60s50thPercentileLatencyMicros>>> | The 50th percentile of sync
| | latency in microseconds (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs60s75thPercentileLatencyMicros>>> | The 75th percentile of sync
| | latency in microseconds (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs60s90thPercentileLatencyMicros>>> | The 90th percentile of sync
| | latency in microseconds (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs60s95thPercentileLatencyMicros>>> | The 95th percentile of sync
| | latency in microseconds (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs60s99thPercentileLatencyMicros>>> | The 99th percentile of sync
| | latency in microseconds (1 minute granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300sNumOps>>> | Number of sync operations (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300s50thPercentileLatencyMicros>>> | The 50th percentile of sync
| | latency in microseconds (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300s75thPercentileLatencyMicros>>> | The 75th percentile of sync
| | latency in microseconds (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300s90thPercentileLatencyMicros>>> | The 90th percentile of sync
| | latency in microseconds (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300s95thPercentileLatencyMicros>>> | The 95th percentile of sync
| | latency in microseconds (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs300s99thPercentileLatencyMicros>>> | The 99th percentile of sync
| | latency in microseconds (5 minutes granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600sNumOps>>> | Number of sync operations (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600s50thPercentileLatencyMicros>>> | The 50th percentile of sync
| | latency in microseconds (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600s75thPercentileLatencyMicros>>> | The 75th percentile of sync
| | latency in microseconds (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600s90thPercentileLatencyMicros>>> | The 90th percentile of sync
| | latency in microseconds (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600s95thPercentileLatencyMicros>>> | The 95th percentile of sync
| | latency in microseconds (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<Syncs3600s99thPercentileLatencyMicros>>> | The 99th percentile of sync
| | latency in microseconds (1 hour granularity)
*-------------------------------------+--------------------------------------+
|<<<BatchesWritten>>> | Total number of batches written since startup
*-------------------------------------+--------------------------------------+
|<<<TxnsWritten>>> | Total number of transactions written since startup
*-------------------------------------+--------------------------------------+
|<<<BytesWritten>>> | Total number of bytes written since startup
*-------------------------------------+--------------------------------------+
|<<<BatchesWrittenWhileLagging>>> | Total number of batches written where this
| | node was lagging
*-------------------------------------+--------------------------------------+
|<<<LastWriterEpoch>>> | Current writer's epoch number
*-------------------------------------+--------------------------------------+
|<<<CurrentLagTxns>>> | The number of transactions that this JournalNode is
| | lagging
*-------------------------------------+--------------------------------------+
|<<<LastWrittenTxId>>> | The highest transaction id stored on this JournalNode
*-------------------------------------+--------------------------------------+
|<<<LastPromisedEpoch>>> | The last epoch number which this node has promised
| | not to accept any lower epoch, or 0 if no promises have been made
*-------------------------------------+--------------------------------------+
* datanode
Each metrics record contains tags such as SessionId and Hostname
as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<BytesWritten>>> | Total number of bytes written to DataNode
*-------------------------------------+--------------------------------------+
|<<<BytesRead>>> | Total number of bytes read from DataNode
*-------------------------------------+--------------------------------------+
|<<<BlocksWritten>>> | Total number of blocks written to DataNode
*-------------------------------------+--------------------------------------+
|<<<BlocksRead>>> | Total number of blocks read from DataNode
*-------------------------------------+--------------------------------------+
|<<<BlocksReplicated>>> | Total number of blocks replicated
*-------------------------------------+--------------------------------------+
|<<<BlocksRemoved>>> | Total number of blocks removed
*-------------------------------------+--------------------------------------+
|<<<BlocksVerified>>> | Total number of blocks verified
*-------------------------------------+--------------------------------------+
|<<<BlockVerificationFailures>>> | Total number of verifications failures
*-------------------------------------+--------------------------------------+
|<<<BlocksCached>>> | Total number of blocks cached
*-------------------------------------+--------------------------------------+
|<<<BlocksUncached>>> | Total number of blocks uncached
*-------------------------------------+--------------------------------------+
|<<<ReadsFromLocalClient>>> | Total number of read operations from local client
*-------------------------------------+--------------------------------------+
|<<<ReadsFromRemoteClient>>> | Total number of read operations from remote
| client
*-------------------------------------+--------------------------------------+
|<<<WritesFromLocalClient>>> | Total number of write operations from local
| client
*-------------------------------------+--------------------------------------+
|<<<WritesFromRemoteClient>>> | Total number of write operations from remote
| client
*-------------------------------------+--------------------------------------+
|<<<BlocksGetLocalPathInfo>>> | Total number of operations to get local path
| names of blocks
*-------------------------------------+--------------------------------------+
|<<<FsyncCount>>> | Total number of fsync
*-------------------------------------+--------------------------------------+
|<<<VolumeFailures>>> | Total number of volume failures occurred
*-------------------------------------+--------------------------------------+
|<<<ReadBlockOpNumOps>>> | Total number of read operations
*-------------------------------------+--------------------------------------+
|<<<ReadBlockOpAvgTime>>> | Average time of read operations in milliseconds
*-------------------------------------+--------------------------------------+
|<<<WriteBlockOpNumOps>>> | Total number of write operations
*-------------------------------------+--------------------------------------+
|<<<WriteBlockOpAvgTime>>> | Average time of write operations in milliseconds
*-------------------------------------+--------------------------------------+
|<<<BlockChecksumOpNumOps>>> | Total number of blockChecksum operations
*-------------------------------------+--------------------------------------+
|<<<BlockChecksumOpAvgTime>>> | Average time of blockChecksum operations in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<CopyBlockOpNumOps>>> | Total number of block copy operations
*-------------------------------------+--------------------------------------+
|<<<CopyBlockOpAvgTime>>> | Average time of block copy operations in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<ReplaceBlockOpNumOps>>> | Total number of block replace operations
*-------------------------------------+--------------------------------------+
|<<<ReplaceBlockOpAvgTime>>> | Average time of block replace operations in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<HeartbeatsNumOps>>> | Total number of heartbeats
*-------------------------------------+--------------------------------------+
|<<<HeartbeatsAvgTime>>> | Average heartbeat time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<BlockReportsNumOps>>> | Total number of block report operations
*-------------------------------------+--------------------------------------+
|<<<BlockReportsAvgTime>>> | Average time of block report operations in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<CacheReportsNumOps>>> | Total number of cache report operations
*-------------------------------------+--------------------------------------+
|<<<CacheReportsAvgTime>>> | Average time of cache report operations in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<PacketAckRoundTripTimeNanosNumOps>>> | Total number of ack round trip
*-------------------------------------+--------------------------------------+
|<<<PacketAckRoundTripTimeNanosAvgTime>>> | Average time from ack send to
| | receive minus the downstream ack time in nanoseconds
*-------------------------------------+--------------------------------------+
|<<<FlushNanosNumOps>>> | Total number of flushes
*-------------------------------------+--------------------------------------+
|<<<FlushNanosAvgTime>>> | Average flush time in nanoseconds
*-------------------------------------+--------------------------------------+
|<<<FsyncNanosNumOps>>> | Total number of fsync
*-------------------------------------+--------------------------------------+
|<<<FsyncNanosAvgTime>>> | Average fsync time in nanoseconds
*-------------------------------------+--------------------------------------+
|<<<SendDataPacketBlockedOnNetworkNanosNumOps>>> | Total number of sending
| packets
*-------------------------------------+--------------------------------------+
|<<<SendDataPacketBlockedOnNetworkNanosAvgTime>>> | Average waiting time of
| | sending packets in nanoseconds
*-------------------------------------+--------------------------------------+
|<<<SendDataPacketTransferNanosNumOps>>> | Total number of sending packets
*-------------------------------------+--------------------------------------+
|<<<SendDataPacketTransferNanosAvgTime>>> | Average transfer time of sending
| packets in nanoseconds
*-------------------------------------+--------------------------------------+
|<<<TotalWriteTime>>> | Total number of milliseconds spent on write
| operation
*-------------------------------------+--------------------------------------+
|<<<TotalReadTime>>> | Total number of milliseconds spent on read
| operation
*-------------------------------------+--------------------------------------+
|<<<RemoteBytesRead>>> | Number of bytes read by remote clients
*-------------------------------------+--------------------------------------+
|<<<RemoteBytesWritten>>> | Number of bytes written by remote clients
*-------------------------------------+--------------------------------------+
yarn context
* ClusterMetrics
ClusterMetrics shows the metrics of the YARN cluster from the
ResourceManager's perspective. Each metrics record contains
Hostname tag as additional information along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<NumActiveNMs>>> | Current number of active NodeManagers
*-------------------------------------+--------------------------------------+
|<<<NumDecommissionedNMs>>> | Current number of decommissioned NodeManagers
*-------------------------------------+--------------------------------------+
|<<<NumLostNMs>>> | Current number of lost NodeManagers for not sending
| heartbeats
*-------------------------------------+--------------------------------------+
|<<<NumUnhealthyNMs>>> | Current number of unhealthy NodeManagers
*-------------------------------------+--------------------------------------+
|<<<NumRebootedNMs>>> | Current number of rebooted NodeManagers
*-------------------------------------+--------------------------------------+
* QueueMetrics
QueueMetrics shows an application queue from the
ResourceManager's perspective. Each metrics record shows
the statistics of each queue, and contains tags such as
queue name and Hostname as additional information along with metrics.
In <<<running_>>><num> metrics such as <<<running_0>>>, you can set the
property <<<yarn.resourcemanager.metrics.runtime.buckets>>> in yarn-site.xml
to change the buckets. The default values is <<<60,300,1440>>>.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<running_0>>> | Current number of running applications whose elapsed time are
| less than 60 minutes
*-------------------------------------+--------------------------------------+
|<<<running_60>>> | Current number of running applications whose elapsed time are
| between 60 and 300 minutes
*-------------------------------------+--------------------------------------+
|<<<running_300>>> | Current number of running applications whose elapsed time are
| between 300 and 1440 minutes
*-------------------------------------+--------------------------------------+
|<<<running_1440>>> | Current number of running applications elapsed time are
| more than 1440 minutes
*-------------------------------------+--------------------------------------+
|<<<AppsSubmitted>>> | Total number of submitted applications
*-------------------------------------+--------------------------------------+
|<<<AppsRunning>>> | Current number of running applications
*-------------------------------------+--------------------------------------+
|<<<AppsPending>>> | Current number of applications that have not yet been
| assigned by any containers
*-------------------------------------+--------------------------------------+
|<<<AppsCompleted>>> | Total number of completed applications
*-------------------------------------+--------------------------------------+
|<<<AppsKilled>>> | Total number of killed applications
*-------------------------------------+--------------------------------------+
|<<<AppsFailed>>> | Total number of failed applications
*-------------------------------------+--------------------------------------+
|<<<AllocatedMB>>> | Current allocated memory in MB
*-------------------------------------+--------------------------------------+
|<<<AllocatedVCores>>> | Current allocated CPU in virtual cores
*-------------------------------------+--------------------------------------+
|<<<AllocatedContainers>>> | Current number of allocated containers
*-------------------------------------+--------------------------------------+
|<<<AggregateContainersAllocated>>> | Total number of allocated containers
*-------------------------------------+--------------------------------------+
|<<<AggregateContainersReleased>>> | Total number of released containers
*-------------------------------------+--------------------------------------+
|<<<AvailableMB>>> | Current available memory in MB
*-------------------------------------+--------------------------------------+
|<<<AvailableVCores>>> | Current available CPU in virtual cores
*-------------------------------------+--------------------------------------+
|<<<PendingMB>>> | Current pending memory resource requests in MB that are
| not yet fulfilled by the scheduler
*-------------------------------------+--------------------------------------+
|<<<PendingVCores>>> | Current pending CPU allocation requests in virtual
| cores that are not yet fulfilled by the scheduler
*-------------------------------------+--------------------------------------+
|<<<PendingContainers>>> | Current pending resource requests that are not
| yet fulfilled by the scheduler
*-------------------------------------+--------------------------------------+
|<<<ReservedMB>>> | Current reserved memory in MB
*-------------------------------------+--------------------------------------+
|<<<ReservedVCores>>> | Current reserved CPU in virtual cores
*-------------------------------------+--------------------------------------+
|<<<ReservedContainers>>> | Current number of reserved containers
*-------------------------------------+--------------------------------------+
|<<<ActiveUsers>>> | Current number of active users
*-------------------------------------+--------------------------------------+
|<<<ActiveApplications>>> | Current number of active applications
*-------------------------------------+--------------------------------------+
|<<<FairShareMB>>> | (FairScheduler only) Current fair share of memory in MB
*-------------------------------------+--------------------------------------+
|<<<FairShareVCores>>> | (FairScheduler only) Current fair share of CPU in
| virtual cores
*-------------------------------------+--------------------------------------+
|<<<MinShareMB>>> | (FairScheduler only) Minimum share of memory in MB
*-------------------------------------+--------------------------------------+
|<<<MinShareVCores>>> | (FairScheduler only) Minimum share of CPU in virtual
| cores
*-------------------------------------+--------------------------------------+
|<<<MaxShareMB>>> | (FairScheduler only) Maximum share of memory in MB
*-------------------------------------+--------------------------------------+
|<<<MaxShareVCores>>> | (FairScheduler only) Maximum share of CPU in virtual
| cores
*-------------------------------------+--------------------------------------+
* NodeManagerMetrics
NodeManagerMetrics shows the statistics of the containers in the node.
Each metrics record contains Hostname tag as additional information
along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<containersLaunched>>> | Total number of launched containers
*-------------------------------------+--------------------------------------+
|<<<containersCompleted>>> | Total number of successfully completed containers
*-------------------------------------+--------------------------------------+
|<<<containersFailed>>> | Total number of failed containers
*-------------------------------------+--------------------------------------+
|<<<containersKilled>>> | Total number of killed containers
*-------------------------------------+--------------------------------------+
|<<<containersIniting>>> | Current number of initializing containers
*-------------------------------------+--------------------------------------+
|<<<containersRunning>>> | Current number of running containers
*-------------------------------------+--------------------------------------+
|<<<allocatedContainers>>> | Current number of allocated containers
*-------------------------------------+--------------------------------------+
|<<<allocatedGB>>> | Current allocated memory in GB
*-------------------------------------+--------------------------------------+
|<<<availableGB>>> | Current available memory in GB
*-------------------------------------+--------------------------------------+
ugi context
* UgiMetrics
UgiMetrics is related to user and group information.
Each metrics record contains Hostname tag as additional information
along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<LoginSuccessNumOps>>> | Total number of successful kerberos logins
*-------------------------------------+--------------------------------------+
|<<<LoginSuccessAvgTime>>> | Average time for successful kerberos logins in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<LoginFailureNumOps>>> | Total number of failed kerberos logins
*-------------------------------------+--------------------------------------+
|<<<LoginFailureAvgTime>>> | Average time for failed kerberos logins in
| milliseconds
*-------------------------------------+--------------------------------------+
|<<<getGroupsNumOps>>> | Total number of group resolutions
*-------------------------------------+--------------------------------------+
|<<<getGroupsAvgTime>>> | Average time for group resolution in milliseconds
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<sNumOps>>> |
| | Total number of group resolutions (<num> seconds granularity). <num> is
| | specified by <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<s50thPercentileLatency>>> |
| | Shows the 50th percentile of group resolution time in milliseconds
| | (<num> seconds granularity). <num> is specified by
| | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<s75thPercentileLatency>>> |
| | Shows the 75th percentile of group resolution time in milliseconds
| | (<num> seconds granularity). <num> is specified by
| | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<s90thPercentileLatency>>> |
| | Shows the 90th percentile of group resolution time in milliseconds
| | (<num> seconds granularity). <num> is specified by
| | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<s95thPercentileLatency>>> |
| | Shows the 95th percentile of group resolution time in milliseconds
| | (<num> seconds granularity). <num> is specified by
| | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
|<<<getGroups>>><num><<<s99thPercentileLatency>>> |
| | Shows the 99th percentile of group resolution time in milliseconds
| | (<num> seconds granularity). <num> is specified by
| | <<<hadoop.user.group.metrics.percentiles.intervals>>>.
*-------------------------------------+--------------------------------------+
metricssystem context
* MetricsSystem
MetricsSystem shows the statistics for metrics snapshots and publishes.
Each metrics record contains Hostname tag as additional information
along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<NumActiveSources>>> | Current number of active metrics sources
*-------------------------------------+--------------------------------------+
|<<<NumAllSources>>> | Total number of metrics sources
*-------------------------------------+--------------------------------------+
|<<<NumActiveSinks>>> | Current number of active sinks
*-------------------------------------+--------------------------------------+
|<<<NumAllSinks>>> | Total number of sinks \
| (BUT usually less than <<<NumActiveSinks>>>,
| see {{{https://issues.apache.org/jira/browse/HADOOP-9946}HADOOP-9946}})
*-------------------------------------+--------------------------------------+
|<<<SnapshotNumOps>>> | Total number of operations to snapshot statistics from
| a metrics source
*-------------------------------------+--------------------------------------+
|<<<SnapshotAvgTime>>> | Average time in milliseconds to snapshot statistics
| from a metrics source
*-------------------------------------+--------------------------------------+
|<<<PublishNumOps>>> | Total number of operations to publish statistics to a
| sink
*-------------------------------------+--------------------------------------+
|<<<PublishAvgTime>>> | Average time in milliseconds to publish statistics to
| a sink
*-------------------------------------+--------------------------------------+
|<<<DroppedPubAll>>> | Total number of dropped publishes
*-------------------------------------+--------------------------------------+
|<<<Sink_>>><instance><<<NumOps>>> | Total number of sink operations for the
| <instance>
*-------------------------------------+--------------------------------------+
|<<<Sink_>>><instance><<<AvgTime>>> | Average time in milliseconds of sink
| operations for the <instance>
*-------------------------------------+--------------------------------------+
|<<<Sink_>>><instance><<<Dropped>>> | Total number of dropped sink operations
| for the <instance>
*-------------------------------------+--------------------------------------+
|<<<Sink_>>><instance><<<Qsize>>> | Current queue length of the sink
*-------------------------------------+--------------------------------------+
default context
* StartupProgress
StartupProgress metrics shows the statistics of NameNode startup.
Four metrics are exposed for each startup phase based on its name.
The startup <phase>s are <<<LoadingFsImage>>>, <<<LoadingEdits>>>,
<<<SavingCheckpoint>>>, and <<<SafeMode>>>.
Each metrics record contains Hostname tag as additional information
along with metrics.
*-------------------------------------+--------------------------------------+
|| Name || Description
*-------------------------------------+--------------------------------------+
|<<<ElapsedTime>>> | Total elapsed time in milliseconds
*-------------------------------------+--------------------------------------+
|<<<PercentComplete>>> | Current rate completed in NameNode startup progress \
| (The max value is not 100 but 1.0)
*-------------------------------------+--------------------------------------+
|<phase><<<Count>>> | Total number of steps completed in the phase
*-------------------------------------+--------------------------------------+
|<phase><<<ElapsedTime>>> | Total elapsed time in the phase in milliseconds
*-------------------------------------+--------------------------------------+
|<phase><<<Total>>> | Total number of steps in the phase
*-------------------------------------+--------------------------------------+
|<phase><<<PercentComplete>>> | Current rate completed in the phase \
| (The max value is not 100 but 1.0)
*-------------------------------------+--------------------------------------+

View File

@ -1,205 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Native Libraries Guide
---
---
${maven.build.timestamp}
Native Libraries Guide
%{toc|section=1|fromDepth=0}
* Overview
This guide describes the native hadoop library and includes a small
discussion about native shared libraries.
Note: Depending on your environment, the term "native libraries" could
refer to all *.so's you need to compile; and, the term "native
compression" could refer to all *.so's you need to compile that are
specifically related to compression. Currently, however, this document
only addresses the native hadoop library (<<<libhadoop.so>>>).
The document for libhdfs library (<<<libhdfs.so>>>) is
{{{../hadoop-hdfs/LibHdfs.html}here}}.
* Native Hadoop Library
Hadoop has native implementations of certain components for performance
reasons and for non-availability of Java implementations. These
components are available in a single, dynamically-linked native library
called the native hadoop library. On the *nix platforms the library is
named <<<libhadoop.so>>>.
* Usage
It is fairly easy to use the native hadoop library:
[[1]] Review the components.
[[2]] Review the supported platforms.
[[3]] Either download a hadoop release, which will include a pre-built
version of the native hadoop library, or build your own version of
the native hadoop library. Whether you download or build, the name
for the library is the same: libhadoop.so
[[4]] Install the compression codec development packages (>zlib-1.2,
>gzip-1.2):
* If you download the library, install one or more development
packages - whichever compression codecs you want to use with
your deployment.
* If you build the library, it is mandatory to install both
development packages.
[[5]] Check the runtime log files.
* Components
The native hadoop library includes various components:
* Compression Codecs (bzip2, lz4, snappy, zlib)
* Native IO utilities for {{{../hadoop-hdfs/ShortCircuitLocalReads.html}
HDFS Short-Circuit Local Reads}} and
{{{../hadoop-hdfs/CentralizedCacheManagement.html}Centralized Cache
Management in HDFS}}
* CRC32 checksum implementation
* Supported Platforms
The native hadoop library is supported on *nix platforms only. The
library does not to work with Cygwin or the Mac OS X platform.
The native hadoop library is mainly used on the GNU/Linus platform and
has been tested on these distributions:
* RHEL4/Fedora
* Ubuntu
* Gentoo
On all the above distributions a 32/64 bit native hadoop library will
work with a respective 32/64 bit jvm.
* Download
The pre-built 32-bit i386-Linux native hadoop library is available as
part of the hadoop distribution and is located in the <<<lib/native>>>
directory. You can download the hadoop distribution from Hadoop Common
Releases.
Be sure to install the zlib and/or gzip development packages -
whichever compression codecs you want to use with your deployment.
* Build
The native hadoop library is written in ANSI C and is built using the
GNU autotools-chain (autoconf, autoheader, automake, autoscan,
libtool). This means it should be straight-forward to build the library
on any platform with a standards-compliant C compiler and the GNU
autotools-chain (see the supported platforms).
The packages you need to install on the target platform are:
* C compiler (e.g. GNU C Compiler)
* GNU Autools Chain: autoconf, automake, libtool
* zlib-development package (stable version >= 1.2.0)
* openssl-development package(e.g. libssl-dev)
Once you installed the prerequisite packages use the standard hadoop
pom.xml file and pass along the native flag to build the native hadoop
library:
----
$ mvn package -Pdist,native -DskipTests -Dtar
----
You should see the newly-built library in:
----
$ hadoop-dist/target/hadoop-${project.version}/lib/native
----
Please note the following:
* It is mandatory to install both the zlib and gzip development
packages on the target platform in order to build the native hadoop
library; however, for deployment it is sufficient to install just
one package if you wish to use only one codec.
* It is necessary to have the correct 32/64 libraries for zlib,
depending on the 32/64 bit jvm for the target platform, in order to
build and deploy the native hadoop library.
* Runtime
The bin/hadoop script ensures that the native hadoop library is on the
library path via the system property:
<<<-Djava.library.path=<path> >>>
During runtime, check the hadoop log files for your MapReduce tasks.
* If everything is all right, then:
<<<DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...>>>
<<<INFO util.NativeCodeLoader - Loaded the native-hadoop library>>>
* If something goes wrong, then:
<<<INFO util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable>>>
* Check
NativeLibraryChecker is a tool to check whether native libraries are loaded correctly.
You can launch NativeLibraryChecker as follows:
----
$ hadoop checknative -a
14/12/06 01:30:45 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
14/12/06 01:30:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/ozawa/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/x86_64-linux-gnu/libz.so.1
snappy: true /usr/lib/libsnappy.so.1
lz4: true revision:99
bzip2: false
----
* Native Shared Libraries
You can load any native shared library using DistributedCache for
distributing and symlinking the library files.
This example shows you how to distribute a shared library, mylib.so,
and load it from a MapReduce task.
[[1]] First copy the library to the HDFS:
<<<bin/hadoop fs -copyFromLocal mylib.so.1 /libraries/mylib.so.1>>>
[[2]] The job launching program should contain the following:
<<<DistributedCache.createSymlink(conf);>>>
<<<DistributedCache.addCacheFile("hdfs://host:port/libraries/mylib.so. 1#mylib.so", conf);>>>
[[3]] The MapReduce task can contain:
<<<System.loadLibrary("mylib.so");>>>
Note: If you downloaded or built the native hadoop library, you dont
need to use DistibutedCache to make the library available to your
MapReduce tasks.

View File

@ -1,689 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop in Secure Mode
---
---
${maven.build.timestamp}
%{toc|section=0|fromDepth=0|toDepth=3}
Hadoop in Secure Mode
* Introduction
This document describes how to configure authentication for Hadoop in
secure mode.
By default Hadoop runs in non-secure mode in which no actual
authentication is required.
By configuring Hadoop runs in secure mode,
each user and service needs to be authenticated by Kerberos
in order to use Hadoop services.
Security features of Hadoop consist of
{{{Authentication}authentication}},
{{{./ServiceLevelAuth.html}service level authorization}},
{{{./HttpAuthentication.html}authentication for Web consoles}}
and {{{Data confidentiality}data confidenciality}}.
* Authentication
** End User Accounts
When service level authentication is turned on,
end users using Hadoop in secure mode needs to be authenticated by Kerberos.
The simplest way to do authentication is using <<<kinit>>> command of Kerberos.
** User Accounts for Hadoop Daemons
Ensure that HDFS and YARN daemons run as different Unix users,
e.g. <<<hdfs>>> and <<<yarn>>>.
Also, ensure that the MapReduce JobHistory server runs as
different user such as <<<mapred>>>.
It's recommended to have them share a Unix group, for e.g. <<<hadoop>>>.
See also "{{Mapping from user to group}}" for group management.
*---------------+----------------------------------------------------------------------+
|| User:Group || Daemons |
*---------------+----------------------------------------------------------------------+
| hdfs:hadoop | NameNode, Secondary NameNode, JournalNode, DataNode |
*---------------+----------------------------------------------------------------------+
| yarn:hadoop | ResourceManager, NodeManager |
*---------------+----------------------------------------------------------------------+
| mapred:hadoop | MapReduce JobHistory Server |
*---------------+----------------------------------------------------------------------+
** Kerberos principals for Hadoop Daemons and Users
For running hadoop service daemons in Hadoop in secure mode,
Kerberos principals are required.
Each service reads auhenticate information saved in keytab file with appropriate permission.
HTTP web-consoles should be served by principal different from RPC's one.
Subsections below shows the examples of credentials for Hadoop services.
*** HDFS
The NameNode keytab file, on the NameNode host, should look like the
following:
----
$ klist -e -k -t /etc/security/keytab/nn.service.keytab
Keytab name: FILE:/etc/security/keytab/nn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
The Secondary NameNode keytab file, on that host, should look like the
following:
----
$ klist -e -k -t /etc/security/keytab/sn.service.keytab
Keytab name: FILE:/etc/security/keytab/sn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
The DataNode keytab file, on each host, should look like the following:
----
$ klist -e -k -t /etc/security/keytab/dn.service.keytab
Keytab name: FILE:/etc/security/keytab/dn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
*** YARN
The ResourceManager keytab file, on the ResourceManager host, should look
like the following:
----
$ klist -e -k -t /etc/security/keytab/rm.service.keytab
Keytab name: FILE:/etc/security/keytab/rm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
The NodeManager keytab file, on each host, should look like the following:
----
$ klist -e -k -t /etc/security/keytab/nm.service.keytab
Keytab name: FILE:/etc/security/keytab/nm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
*** MapReduce JobHistory Server
The MapReduce JobHistory Server keytab file, on that host, should look
like the following:
----
$ klist -e -k -t /etc/security/keytab/jhs.service.keytab
Keytab name: FILE:/etc/security/keytab/jhs.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
----
** Mapping from Kerberos principal to OS user account
Hadoop maps Kerberos principal to OS user account using
the rule specified by <<<hadoop.security.auth_to_local>>>
which works in the same way as the <<<auth_to_local>>> in
{{{http://web.mit.edu/Kerberos/krb5-latest/doc/admin/conf_files/krb5_conf.html}Kerberos configuration file (krb5.conf)}}.
In addition, Hadoop <<<auth_to_local>>> mapping supports the <</L>> flag that
lowercases the returned name.
By default, it picks the first component of principal name as a user name
if the realms matches to the <<<default_realm>>> (usually defined in /etc/krb5.conf).
For example, <<<host/full.qualified.domain.name@REALM.TLD>>> is mapped to <<<host>>>
by default rule.
** Mapping from user to group
Though files on HDFS are associated to owner and group,
Hadoop does not have the definition of group by itself.
Mapping from user to group is done by OS or LDAP.
You can change a way of mapping by
specifying the name of mapping provider as a value of
<<<hadoop.security.group.mapping>>>
See {{{../hadoop-hdfs/HdfsPermissionsGuide.html}HDFS Permissions Guide}} for details.
Practically you need to manage SSO environment using Kerberos with LDAP
for Hadoop in secure mode.
** Proxy user
Some products such as Apache Oozie which access the services of Hadoop
on behalf of end users need to be able to impersonate end users.
See {{{./Superusers.html}the doc of proxy user}} for details.
** Secure DataNode
Because the data transfer protocol of DataNode
does not use the RPC framework of Hadoop,
DataNode must authenticate itself by
using privileged ports which are specified by
<<<dfs.datanode.address>>> and <<<dfs.datanode.http.address>>>.
This authentication is based on the assumption
that the attacker won't be able to get root privileges.
When you execute <<<hdfs datanode>>> command as root,
server process binds privileged port at first,
then drops privilege and runs as the user account specified by
<<<HADOOP_SECURE_DN_USER>>>.
This startup process uses jsvc installed to <<<JSVC_HOME>>>.
You must specify <<<HADOOP_SECURE_DN_USER>>> and <<<JSVC_HOME>>>
as environment variables on start up (in hadoop-env.sh).
As of version 2.6.0, SASL can be used to authenticate the data transfer
protocol. In this configuration, it is no longer required for secured clusters
to start the DataNode as root using jsvc and bind to privileged ports. To
enable SASL on data transfer protocol, set <<<dfs.data.transfer.protection>>>
in hdfs-site.xml, set a non-privileged port for <<<dfs.datanode.address>>>, set
<<<dfs.http.policy>>> to <HTTPS_ONLY> and make sure the
<<<HADOOP_SECURE_DN_USER>>> environment variable is not defined. Note that it
is not possible to use SASL on data transfer protocol if
<<<dfs.datanode.address>>> is set to a privileged port. This is required for
backwards-compatibility reasons.
In order to migrate an existing cluster that used root authentication to start
using SASL instead, first ensure that version 2.6.0 or later has been deployed
to all cluster nodes as well as any external applications that need to connect
to the cluster. Only versions 2.6.0 and later of the HDFS client can connect
to a DataNode that uses SASL for authentication of data transfer protocol, so
it is vital that all callers have the correct version before migrating. After
version 2.6.0 or later has been deployed everywhere, update configuration of
any external applications to enable SASL. If an HDFS client is enabled for
SASL, then it can connect successfully to a DataNode running with either root
authentication or SASL authentication. Changing configuration for all clients
guarantees that subsequent configuration changes on DataNodes will not disrupt
the applications. Finally, each individual DataNode can be migrated by
changing its configuration and restarting. It is acceptable to have a mix of
some DataNodes running with root authentication and some DataNodes running with
SASL authentication temporarily during this migration period, because an HDFS
client enabled for SASL can connect to both.
* Data confidentiality
** Data Encryption on RPC
The data transfered between hadoop services and clients.
Setting <<<hadoop.rpc.protection>>> to <<<"privacy">>> in the core-site.xml
activate data encryption.
** Data Encryption on Block data transfer.
You need to set <<<dfs.encrypt.data.transfer>>> to <<<"true">>> in the hdfs-site.xml
in order to activate data encryption for data transfer protocol of DataNode.
Optionally, you may set <<<dfs.encrypt.data.transfer.algorithm>>> to either
"3des" or "rc4" to choose the specific encryption algorithm. If unspecified,
then the configured JCE default on the system is used, which is usually 3DES.
Setting <<<dfs.encrypt.data.transfer.cipher.suites>>> to
<<<AES/CTR/NoPadding>>> activates AES encryption. By default, this is
unspecified, so AES is not used. When AES is used, the algorithm specified in
<<<dfs.encrypt.data.transfer.algorithm>>> is still used during an initial key
exchange. The AES key bit length can be configured by setting
<<<dfs.encrypt.data.transfer.cipher.key.bitlength>>> to 128, 192 or 256. The
default is 128.
AES offers the greatest cryptographic strength and the best performance. At
this time, 3DES and RC4 have been used more often in Hadoop clusters.
** Data Encryption on HTTP
Data transfer between Web-console and clients are protected by using SSL(HTTPS).
* Configuration
** Permissions for both HDFS and local fileSystem paths
The following table lists various paths on HDFS and local filesystems (on
all nodes) and recommended permissions:
*-------------------+-------------------+------------------+------------------+
|| Filesystem || Path || User:Group || Permissions |
*-------------------+-------------------+------------------+------------------+
| local | <<<dfs.namenode.name.dir>>> | hdfs:hadoop | drwx------ |
*-------------------+-------------------+------------------+------------------+
| local | <<<dfs.datanode.data.dir>>> | hdfs:hadoop | drwx------ |
*-------------------+-------------------+------------------+------------------+
| local | $HADOOP_LOG_DIR | hdfs:hadoop | drwxrwxr-x |
*-------------------+-------------------+------------------+------------------+
| local | $YARN_LOG_DIR | yarn:hadoop | drwxrwxr-x |
*-------------------+-------------------+------------------+------------------+
| local | <<<yarn.nodemanager.local-dirs>>> | yarn:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
| local | <<<yarn.nodemanager.log-dirs>>> | yarn:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
| local | container-executor | root:hadoop | --Sr-s--- |
*-------------------+-------------------+------------------+------------------+
| local | <<<conf/container-executor.cfg>>> | root:hadoop | r-------- |
*-------------------+-------------------+------------------+------------------+
| hdfs | / | hdfs:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
| hdfs | /tmp | hdfs:hadoop | drwxrwxrwxt |
*-------------------+-------------------+------------------+------------------+
| hdfs | /user | hdfs:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
| hdfs | <<<yarn.nodemanager.remote-app-log-dir>>> | yarn:hadoop | drwxrwxrwxt |
*-------------------+-------------------+------------------+------------------+
| hdfs | <<<mapreduce.jobhistory.intermediate-done-dir>>> | mapred:hadoop | |
| | | | drwxrwxrwxt |
*-------------------+-------------------+------------------+------------------+
| hdfs | <<<mapreduce.jobhistory.done-dir>>> | mapred:hadoop | |
| | | | drwxr-x--- |
*-------------------+-------------------+------------------+------------------+
** Common Configurations
In order to turn on RPC authentication in hadoop,
set the value of <<<hadoop.security.authentication>>> property to
<<<"kerberos">>>, and set security related settings listed below appropriately.
The following properties should be in the <<<core-site.xml>>> of all the
nodes in the cluster.
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<hadoop.security.authentication>>> | <kerberos> | |
| | | <<<simple>>> : No authentication. (default) \
| | | <<<kerberos>>> : Enable authentication by Kerberos. |
*-------------------------+-------------------------+------------------------+
| <<<hadoop.security.authorization>>> | <true> | |
| | | Enable {{{./ServiceLevelAuth.html}RPC service-level authorization}}. |
*-------------------------+-------------------------+------------------------+
| <<<hadoop.rpc.protection>>> | <authentication> |
| | | <authentication> : authentication only (default) \
| | | <integrity> : integrity check in addition to authentication \
| | | <privacy> : data encryption in addition to integrity |
*-------------------------+-------------------------+------------------------+
| <<<hadoop.security.auth_to_local>>> | | |
| | <<<RULE:>>><exp1>\
| | <<<RULE:>>><exp2>\
| | <...>\
| | DEFAULT |
| | | The value is string containing new line characters.
| | | See
| | | {{{http://web.mit.edu/Kerberos/krb5-latest/doc/admin/conf_files/krb5_conf.html}Kerberos documentation}}
| | | for format for <exp>.
*-------------------------+-------------------------+------------------------+
| <<<hadoop.proxyuser.>>><superuser><<<.hosts>>> | | |
| | | comma separated hosts from which <superuser> access are allowd to impersonation. |
| | | <<<*>>> means wildcard. |
*-------------------------+-------------------------+------------------------+
| <<<hadoop.proxyuser.>>><superuser><<<.groups>>> | | |
| | | comma separated groups to which users impersonated by <superuser> belongs. |
| | | <<<*>>> means wildcard. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/core-site.xml>>>
** NameNode
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<dfs.block.access.token.enable>>> | <true> | |
| | | Enable HDFS block access tokens for secure operations. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.https.enable>>> | <true> | |
| | | This value is deprecated. Use dfs.http.policy |
*-------------------------+-------------------------+------------------------+
| <<<dfs.http.policy>>> | <HTTP_ONLY> or <HTTPS_ONLY> or <HTTP_AND_HTTPS> | |
| | | HTTPS_ONLY turns off http access. This option takes precedence over |
| | | the deprecated configuration dfs.https.enable and hadoop.ssl.enabled. |
| | | If using SASL to authenticate data transfer protocol instead of |
| | | running DataNode as root and using privileged ports, then this property |
| | | must be set to <HTTPS_ONLY> to guarantee authentication of HTTP servers. |
| | | (See <<<dfs.data.transfer.protection>>>.) |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.https-address>>> | <nn_host_fqdn:50470> | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.https.port>>> | <50470> | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.keytab.file>>> | </etc/security/keytab/nn.service.keytab> | |
| | | Kerberos keytab file for the NameNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.kerberos.principal>>> | nn/_HOST@REALM.TLD | |
| | | Kerberos principal name for the NameNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.kerberos.internal.spnego.principal>>> | HTTP/_HOST@REALM.TLD | |
| | | HTTP Kerberos principal name for the NameNode. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/hdfs-site.xml>>>
** Secondary NameNode
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.secondary.http-address>>> | <c_nn_host_fqdn:50090> | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.namenode.secondary.https-port>>> | <50470> | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.secondary.namenode.keytab.file>>> | | |
| | </etc/security/keytab/sn.service.keytab> | |
| | | Kerberos keytab file for the Secondary NameNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.secondary.namenode.kerberos.principal>>> | sn/_HOST@REALM.TLD | |
| | | Kerberos principal name for the Secondary NameNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.secondary.namenode.kerberos.internal.spnego.principal>>> | | |
| | HTTP/_HOST@REALM.TLD | |
| | | HTTP Kerberos principal name for the Secondary NameNode. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/hdfs-site.xml>>>
** DataNode
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.data.dir.perm>>> | 700 | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.address>>> | <0.0.0.0:1004> | |
| | | Secure DataNode must use privileged port |
| | | in order to assure that the server was started securely. |
| | | This means that the server must be started via jsvc. |
| | | Alternatively, this must be set to a non-privileged port if using SASL |
| | | to authenticate data transfer protocol. |
| | | (See <<<dfs.data.transfer.protection>>>.) |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.http.address>>> | <0.0.0.0:1006> | |
| | | Secure DataNode must use privileged port |
| | | in order to assure that the server was started securely. |
| | | This means that the server must be started via jsvc. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.https.address>>> | <0.0.0.0:50470> | |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.keytab.file>>> | </etc/security/keytab/dn.service.keytab> | |
| | | Kerberos keytab file for the DataNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.datanode.kerberos.principal>>> | dn/_HOST@REALM.TLD | |
| | | Kerberos principal name for the DataNode. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.encrypt.data.transfer>>> | <false> | |
| | | set to <<<true>>> when using data encryption |
*-------------------------+-------------------------+------------------------+
| <<<dfs.encrypt.data.transfer.algorithm>>> | | |
| | | optionally set to <<<3des>>> or <<<rc4>>> when using data encryption to |
| | | control encryption algorithm |
*-------------------------+-------------------------+------------------------+
| <<<dfs.encrypt.data.transfer.cipher.suites>>> | | |
| | | optionally set to <<<AES/CTR/NoPadding>>> to activate AES encryption |
| | | when using data encryption |
*-------------------------+-------------------------+------------------------+
| <<<dfs.encrypt.data.transfer.cipher.key.bitlength>>> | | |
| | | optionally set to <<<128>>>, <<<192>>> or <<<256>>> to control key bit |
| | | length when using AES with data encryption |
*-------------------------+-------------------------+------------------------+
| <<<dfs.data.transfer.protection>>> | | |
| | | <authentication> : authentication only \
| | | <integrity> : integrity check in addition to authentication \
| | | <privacy> : data encryption in addition to integrity |
| | | This property is unspecified by default. Setting this property enables |
| | | SASL for authentication of data transfer protocol. If this is enabled, |
| | | then <<<dfs.datanode.address>>> must use a non-privileged port, |
| | | <<<dfs.http.policy>>> must be set to <HTTPS_ONLY> and the |
| | | <<<HADOOP_SECURE_DN_USER>>> environment variable must be undefined when |
| | | starting the DataNode process. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/hdfs-site.xml>>>
** WebHDFS
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<dfs.web.authentication.kerberos.principal>>> | http/_HOST@REALM.TLD | |
| | | Kerberos keytab file for the WebHDFS. |
*-------------------------+-------------------------+------------------------+
| <<<dfs.web.authentication.kerberos.keytab>>> | </etc/security/keytab/http.service.keytab> | |
| | | Kerberos principal name for WebHDFS. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/hdfs-site.xml>>>
** ResourceManager
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<yarn.resourcemanager.keytab>>> | | |
| | </etc/security/keytab/rm.service.keytab> | |
| | | Kerberos keytab file for the ResourceManager. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.resourcemanager.principal>>> | rm/_HOST@REALM.TLD | |
| | | Kerberos principal name for the ResourceManager. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/yarn-site.xml>>>
** NodeManager
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.keytab>>> | </etc/security/keytab/nm.service.keytab> | |
| | | Kerberos keytab file for the NodeManager. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.principal>>> | nm/_HOST@REALM.TLD | |
| | | Kerberos principal name for the NodeManager. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.container-executor.class>>> | | |
| | <<<org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor>>> |
| | | Use LinuxContainerExecutor. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.linux-container-executor.group>>> | <hadoop> | |
| | | Unix group of the NodeManager. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.linux-container-executor.path>>> | </path/to/bin/container-executor> | |
| | | The path to the executable of Linux container executor. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/yarn-site.xml>>>
** Configuration for WebAppProxy
The <<<WebAppProxy>>> provides a proxy between the web applications
exported by an application and an end user. If security is enabled
it will warn users before accessing a potentially unsafe web application.
Authentication and authorization using the proxy is handled just like
any other privileged web application.
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<yarn.web-proxy.address>>> | | |
| | <<<WebAppProxy>>> host:port for proxy to AM web apps. | |
| | | <host:port> if this is the same as <<<yarn.resourcemanager.webapp.address>>>|
| | | or it is not defined then the <<<ResourceManager>>> will run the proxy|
| | | otherwise a standalone proxy server will need to be launched.|
*-------------------------+-------------------------+------------------------+
| <<<yarn.web-proxy.keytab>>> | | |
| | </etc/security/keytab/web-app.service.keytab> | |
| | | Kerberos keytab file for the WebAppProxy. |
*-------------------------+-------------------------+------------------------+
| <<<yarn.web-proxy.principal>>> | wap/_HOST@REALM.TLD | |
| | | Kerberos principal name for the WebAppProxy. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/yarn-site.xml>>>
** LinuxContainerExecutor
A <<<ContainerExecutor>>> used by YARN framework which define how any
<container> launched and controlled.
The following are the available in Hadoop YARN:
*--------------------------------------+--------------------------------------+
|| ContainerExecutor || Description |
*--------------------------------------+--------------------------------------+
| <<<DefaultContainerExecutor>>> | |
| | The default executor which YARN uses to manage container execution. |
| | The container process has the same Unix user as the NodeManager. |
*--------------------------------------+--------------------------------------+
| <<<LinuxContainerExecutor>>> | |
| | Supported only on GNU/Linux, this executor runs the containers as either the |
| | YARN user who submitted the application (when full security is enabled) or |
| | as a dedicated user (defaults to nobody) when full security is not enabled. |
| | When full security is enabled, this executor requires all user accounts to be |
| | created on the cluster nodes where the containers are launched. It uses |
| | a <setuid> executable that is included in the Hadoop distribution. |
| | The NodeManager uses this executable to launch and kill containers. |
| | The setuid executable switches to the user who has submitted the |
| | application and launches or kills the containers. For maximum security, |
| | this executor sets up restricted permissions and user/group ownership of |
| | local files and directories used by the containers such as the shared |
| | objects, jars, intermediate files, log files etc. Particularly note that, |
| | because of this, except the application owner and NodeManager, no other |
| | user can access any of the local files/directories including those |
| | localized as part of the distributed cache. |
*--------------------------------------+--------------------------------------+
To build the LinuxContainerExecutor executable run:
----
$ mvn package -Dcontainer-executor.conf.dir=/etc/hadoop/
----
The path passed in <<<-Dcontainer-executor.conf.dir>>> should be the
path on the cluster nodes where a configuration file for the setuid
executable should be located. The executable should be installed in
$HADOOP_YARN_HOME/bin.
The executable must have specific permissions: 6050 or --Sr-s---
permissions user-owned by <root> (super-user) and group-owned by a
special group (e.g. <<<hadoop>>>) of which the NodeManager Unix user is
the group member and no ordinary application user is. If any application
user belongs to this special group, security will be compromised. This
special group name should be specified for the configuration property
<<<yarn.nodemanager.linux-container-executor.group>>> in both
<<<conf/yarn-site.xml>>> and <<<conf/container-executor.cfg>>>.
For example, let's say that the NodeManager is run as user <yarn> who is
part of the groups users and <hadoop>, any of them being the primary group.
Let also be that <users> has both <yarn> and another user
(application submitter) <alice> as its members, and <alice> does not
belong to <hadoop>. Going by the above description, the setuid/setgid
executable should be set 6050 or --Sr-s--- with user-owner as <yarn> and
group-owner as <hadoop> which has <yarn> as its member (and not <users>
which has <alice> also as its member besides <yarn>).
The LinuxTaskController requires that paths including and leading up to
the directories specified in <<<yarn.nodemanager.local-dirs>>> and
<<<yarn.nodemanager.log-dirs>>> to be set 755 permissions as described
above in the table on permissions on directories.
* <<<conf/container-executor.cfg>>>
The executable requires a configuration file called
<<<container-executor.cfg>>> to be present in the configuration
directory passed to the mvn target mentioned above.
The configuration file must be owned by the user running NodeManager
(user <<<yarn>>> in the above example), group-owned by anyone and
should have the permissions 0400 or r--------.
The executable requires following configuration items to be present
in the <<<conf/container-executor.cfg>>> file. The items should be
mentioned as simple key=value pairs, one per-line:
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<yarn.nodemanager.linux-container-executor.group>>> | <hadoop> | |
| | | Unix group of the NodeManager. The group owner of the |
| | |<container-executor> binary should be this group. Should be same as the |
| | | value with which the NodeManager is configured. This configuration is |
| | | required for validating the secure access of the <container-executor> |
| | | binary. |
*-------------------------+-------------------------+------------------------+
| <<<banned.users>>> | hdfs,yarn,mapred,bin | Banned users. |
*-------------------------+-------------------------+------------------------+
| <<<allowed.system.users>>> | foo,bar | Allowed system users. |
*-------------------------+-------------------------+------------------------+
| <<<min.user.id>>> | 1000 | Prevent other super-users. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/yarn-site.xml>>>
To re-cap, here are the local file-sysytem permissions required for the
various paths related to the <<<LinuxContainerExecutor>>>:
*-------------------+-------------------+------------------+------------------+
|| Filesystem || Path || User:Group || Permissions |
*-------------------+-------------------+------------------+------------------+
| local | container-executor | root:hadoop | --Sr-s--- |
*-------------------+-------------------+------------------+------------------+
| local | <<<conf/container-executor.cfg>>> | root:hadoop | r-------- |
*-------------------+-------------------+------------------+------------------+
| local | <<<yarn.nodemanager.local-dirs>>> | yarn:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
| local | <<<yarn.nodemanager.log-dirs>>> | yarn:hadoop | drwxr-xr-x |
*-------------------+-------------------+------------------+------------------+
** MapReduce JobHistory Server
*-------------------------+-------------------------+------------------------+
|| Parameter || Value || Notes |
*-------------------------+-------------------------+------------------------+
| <<<mapreduce.jobhistory.address>>> | | |
| | MapReduce JobHistory Server <host:port> | Default port is 10020. |
*-------------------------+-------------------------+------------------------+
| <<<mapreduce.jobhistory.keytab>>> | |
| | </etc/security/keytab/jhs.service.keytab> | |
| | | Kerberos keytab file for the MapReduce JobHistory Server. |
*-------------------------+-------------------------+------------------------+
| <<<mapreduce.jobhistory.principal>>> | jhs/_HOST@REALM.TLD | |
| | | Kerberos principal name for the MapReduce JobHistory Server. |
*-------------------------+-------------------------+------------------------+
Configuration for <<<conf/mapred-site.xml>>>

View File

@ -1,216 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Service Level Authorization Guide
---
---
${maven.build.timestamp}
Service Level Authorization Guide
%{toc|section=1|fromDepth=0}
* Purpose
This document describes how to configure and manage Service Level
Authorization for Hadoop.
* Prerequisites
Make sure Hadoop is installed, configured and setup correctly. For more
information see:
* {{{./SingleCluster.html}Single Node Setup}} for first-time users.
* {{{./ClusterSetup.html}Cluster Setup}} for large, distributed clusters.
* Overview
Service Level Authorization is the initial authorization mechanism to
ensure clients connecting to a particular Hadoop service have the
necessary, pre-configured, permissions and are authorized to access the
given service. For example, a MapReduce cluster can use this mechanism
to allow a configured list of users/groups to submit jobs.
The <<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>> configuration file is used to
define the access control lists for various Hadoop services.
Service Level Authorization is performed much before to other access
control checks such as file-permission checks, access control on job
queues etc.
* Configuration
This section describes how to configure service-level authorization via
the configuration file <<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>>.
** Enable Service Level Authorization
By default, service-level authorization is disabled for Hadoop. To
enable it set the configuration property hadoop.security.authorization
to true in <<<${HADOOP_CONF_DIR}/core-site.xml>>>.
** Hadoop Services and Configuration Properties
This section lists the various Hadoop services and their configuration
knobs:
*-------------------------------------+--------------------------------------+
|| Property || Service
*-------------------------------------+--------------------------------------+
security.client.protocol.acl | ACL for ClientProtocol, which is used by user code via the DistributedFileSystem.
*-------------------------------------+--------------------------------------+
security.client.datanode.protocol.acl | ACL for ClientDatanodeProtocol, the client-to-datanode protocol for block recovery.
*-------------------------------------+--------------------------------------+
security.datanode.protocol.acl | ACL for DatanodeProtocol, which is used by datanodes to communicate with the namenode.
*-------------------------------------+--------------------------------------+
security.inter.datanode.protocol.acl | ACL for InterDatanodeProtocol, the inter-datanode protocol for updating generation timestamp.
*-------------------------------------+--------------------------------------+
security.namenode.protocol.acl | ACL for NamenodeProtocol, the protocol used by the secondary namenode to communicate with the namenode.
*-------------------------------------+--------------------------------------+
security.inter.tracker.protocol.acl | ACL for InterTrackerProtocol, used by the tasktrackers to communicate with the jobtracker.
*-------------------------------------+--------------------------------------+
security.job.submission.protocol.acl | ACL for JobSubmissionProtocol, used by job clients to communciate with the jobtracker for job submission, querying job status etc.
*-------------------------------------+--------------------------------------+
security.task.umbilical.protocol.acl | ACL for TaskUmbilicalProtocol, used by the map and reduce tasks to communicate with the parent tasktracker.
*-------------------------------------+--------------------------------------+
security.refresh.policy.protocol.acl | ACL for RefreshAuthorizationPolicyProtocol, used by the dfsadmin and mradmin commands to refresh the security policy in-effect.
*-------------------------------------+--------------------------------------+
security.ha.service.protocol.acl | ACL for HAService protocol used by HAAdmin to manage the active and stand-by states of namenode.
*-------------------------------------+--------------------------------------+
** Access Control Lists
<<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>> defines an access control list for
each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names.
The two lists are separated by a space.
Example: <<<user1,user2 group1,group2>>>.
Add a blank at the beginning of the line if only a list of groups is to
be provided, equivalently a comma-separated list of users followed by
a space or nothing implies only a set of given users.
A special value of <<<*>>> implies that all users are allowed to access the
service.
If access control list is not defined for a service, the value of
<<<security.service.authorization.default.acl>>> is applied. If
<<<security.service.authorization.default.acl>>> is not defined, <<<*>>> is applied.
** Blocked Access Control Lists
In some cases, it is required to specify blocked access control list for a service. This specifies
the list of users and groups who are not authorized to access the service. The format of
the blocked access control list is same as that of access control list. The blocked access
control list can be specified via <<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>>. The property name
is derived by suffixing with ".blocked".
Example: The property name of blocked access control list for <<<security.client.protocol.acl>>
will be <<<security.client.protocol.acl.blocked>>>
For a service, it is possible to specify both an access control list and a blocked control
list. A user is authorized to access the service if the user is in the access control and not in
the blocked access control list.
If blocked access control list is not defined for a service, the value of
<<<security.service.authorization.default.acl.blocked>>> is applied. If
<<<security.service.authorization.default.acl.blocked>>> is not defined,
empty blocked access control list is applied.
** Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and
JobTracker can be changed without restarting either of the Hadoop
master daemons. The cluster administrator can change
<<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>> on the master nodes and instruct
the NameNode and JobTracker to reload their respective configurations
via the <<<-refreshServiceAcl>>> switch to <<<dfsadmin>>> and <<<mradmin>>> commands
respectively.
Refresh the service-level authorization configuration for the NameNode:
----
$ bin/hadoop dfsadmin -refreshServiceAcl
----
Refresh the service-level authorization configuration for the
JobTracker:
----
$ bin/hadoop mradmin -refreshServiceAcl
----
Of course, one can use the <<<security.refresh.policy.protocol.acl>>>
property in <<<${HADOOP_CONF_DIR}/hadoop-policy.xml>>> to restrict access to
the ability to refresh the service-level authorization configuration to
certain users/groups.
** Access Control using list of ip addresses, host names and ip ranges
Access to a service can be controlled based on the ip address of the client accessing
the service. It is possible to restrict access to a service from a set of machines by
specifying a list of ip addresses, host names and ip ranges. The property name for each service
is derived from the corresponding acl's property name. If the property name of acl is
security.client.protocol.acl, property name for the hosts list will be
security.client.protocol.hosts.
If hosts list is not defined for a service, the value of
<<<security.service.authorization.default.hosts>>> is applied. If
<<<security.service.authorization.default.hosts>>> is not defined, <<<*>>> is applied.
It is possible to specify a blocked list of hosts. Only those machines which are in the
hosts list, but not in the blocked hosts list will be granted access to the service. The property
name is derived by suffixing with ".blocked".
Example: The property name of blocked hosts list for <<<security.client.protocol.hosts>>
will be <<<security.client.protocol.hosts.blocked>>>
If blocked hosts list is not defined for a service, the value of
<<<security.service.authorization.default.hosts.blocked>>> is applied. If
<<<security.service.authorization.default.hosts.blocked>>> is not defined,
empty blocked hosts list is applied.
** Examples
Allow only users <<<alice>>>, <<<bob>>> and users in the <<<mapreduce>>> group to submit
jobs to the MapReduce cluster:
----
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
----
Allow only DataNodes running as the users who belong to the group
datanodes to communicate with the NameNode:
----
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
----
Allow any user to talk to the HDFS cluster as a DFSClient:
----
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>
----

View File

@ -1,286 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop MapReduce Next Generation ${project.version} - Setting up a Single Node Cluster.
---
---
${maven.build.timestamp}
Hadoop MapReduce Next Generation - Setting up a Single Node Cluster.
%{toc|section=1|fromDepth=0}
* Purpose
This document describes how to set up and configure a single-node Hadoop
installation so that you can quickly perform simple operations using Hadoop
MapReduce and the Hadoop Distributed File System (HDFS).
* Prerequisites
** Supported Platforms
* GNU/Linux is supported as a development and production platform.
Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
* Windows is also a supported platform but the followings steps
are for Linux only. To set up Hadoop on Windows, see
{{{http://wiki.apache.org/hadoop/Hadoop2OnWindows}wiki page}}.
** Required Software
Required software for Linux include:
[[1]] Java\u2122 must be installed. Recommended Java versions are described
at {{{http://wiki.apache.org/hadoop/HadoopJavaVersions}
HadoopJavaVersions}}.
[[2]] ssh must be installed and sshd must be running to use the Hadoop
scripts that manage remote Hadoop daemons.
** Installing Software
If your cluster doesn't have the requisite software you will need to install
it.
For example on Ubuntu Linux:
----
$ sudo apt-get install ssh
$ sudo apt-get install rsync
----
* Download
To get a Hadoop distribution, download a recent stable release from one of
the {{{http://www.apache.org/dyn/closer.cgi/hadoop/common/}
Apache Download Mirrors}}.
* Prepare to Start the Hadoop Cluster
Unpack the downloaded Hadoop distribution. In the distribution, edit
the file <<<etc/hadoop/hadoop-env.sh>>> to define some parameters as
follows:
----
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
# Assuming your installation directory is /usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
----
Try the following command:
----
$ bin/hadoop
----
This will display the usage documentation for the hadoop script.
Now you are ready to start your Hadoop cluster in one of the three supported
modes:
* {{{Standalone Operation}Local (Standalone) Mode}}
* {{{Pseudo-Distributed Operation}Pseudo-Distributed Mode}}
* {{{Fully-Distributed Operation}Fully-Distributed Mode}}
* Standalone Operation
By default, Hadoop is configured to run in a non-distributed mode, as a
single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input
and then finds and displays every match of the given regular expression.
Output is written to the given output directory.
----
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+'
$ cat output/*
----
* Pseudo-Distributed Operation
Hadoop can also be run on a single-node in a pseudo-distributed mode where
each Hadoop daemon runs in a separate Java process.
** Configuration
Use the following:
etc/hadoop/core-site.xml:
+---+
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
+---+
etc/hadoop/hdfs-site.xml:
+---+
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
+---+
** Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
----
$ ssh localhost
----
If you cannot ssh to localhost without a passphrase, execute the
following commands:
----
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
----
** Execution
The following instructions are to run a MapReduce job locally.
If you want to execute a job on YARN, see {{YARN on Single Node}}.
[[1]] Format the filesystem:
----
$ bin/hdfs namenode -format
----
[[2]] Start NameNode daemon and DataNode daemon:
----
$ sbin/start-dfs.sh
----
The hadoop daemon log output is written to the <<<${HADOOP_LOG_DIR}>>>
directory (defaults to <<<${HADOOP_HOME}/logs>>>).
[[3]] Browse the web interface for the NameNode; by default it is
available at:
* NameNode - <<<http://localhost:50070/>>>
[[4]] Make the HDFS directories required to execute MapReduce jobs:
----
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
----
[[5]] Copy the input files into the distributed filesystem:
----
$ bin/hdfs dfs -put etc/hadoop input
----
[[6]] Run some of the examples provided:
----
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+'
----
[[7]] Examine the output files:
Copy the output files from the distributed filesystem to the local
filesystem and examine them:
----
$ bin/hdfs dfs -get output output
$ cat output/*
----
or
View the output files on the distributed filesystem:
----
$ bin/hdfs dfs -cat output/*
----
[[8]] When you're done, stop the daemons with:
----
$ sbin/stop-dfs.sh
----
** YARN on Single Node
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting
a few parameters and running ResourceManager daemon and NodeManager daemon
in addition.
The following instructions assume that 1. ~ 4. steps of
{{{Execution}the above instructions}} are already executed.
[[1]] Configure parameters as follows:
etc/hadoop/mapred-site.xml:
+---+
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
+---+
etc/hadoop/yarn-site.xml:
+---+
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
+---+
[[2]] Start ResourceManager daemon and NodeManager daemon:
----
$ sbin/start-yarn.sh
----
[[3]] Browse the web interface for the ResourceManager; by default it is
available at:
* ResourceManager - <<<http://localhost:8088/>>>
[[4]] Run a MapReduce job.
[[5]] When you're done, stop the daemons with:
----
$ sbin/stop-yarn.sh
----
* Fully-Distributed Operation
For information on setting up fully-distributed, non-trivial clusters
see {{{./ClusterSetup.html}Cluster Setup}}.

View File

@ -1,24 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Single Node Setup
---
---
${maven.build.timestamp}
Single Node Setup
This page will be removed in the next major release.
See {{{./SingleCluster.html}Single Cluster Setup}} to set up and configure a
single-node Hadoop installation.

View File

@ -1,144 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Proxy user - Superusers Acting On Behalf Of Other Users
---
---
${maven.build.timestamp}
Proxy user - Superusers Acting On Behalf Of Other Users
%{toc|section=1|fromDepth=0}
* Introduction
This document describes how a superuser can submit jobs or access hdfs
on behalf of another user.
* Use Case
The code example described in the next section is applicable for the
following use case.
A superuser with username 'super' wants to submit job and access hdfs
on behalf of a user joe. The superuser has kerberos credentials but
user joe doesn't have any. The tasks are required to run as user joe
and any file accesses on namenode are required to be done as user joe.
It is required that user joe can connect to the namenode or job tracker
on a connection authenticated with super's kerberos credentials. In
other words super is impersonating the user joe.
Some products such as Apache Oozie need this.
* Code example
In this example super's credentials are used for login and a
proxy user ugi object is created for joe. The operations are performed
within the doAs method of this proxy user ugi object.
----
...
//Create ugi for joe. The login user is 'super'.
UserGroupInformation ugi =
UserGroupInformation.createProxyUser("joe", UserGroupInformation.getLoginUser());
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
//Submit a job
JobClient jc = new JobClient(conf);
jc.submitJob(conf);
//OR access hdfs
FileSystem fs = FileSystem.get(conf);
fs.mkdir(someFilePath);
}
}
----
* Configurations
You can configure proxy user using properties
<<<hadoop.proxyuser.${superuser}.hosts>>> along with either or both of
<<<hadoop.proxyuser.${superuser}.groups>>>
and <<<hadoop.proxyuser.${superuser}.users>>>.
By specifying as below in core-site.xml,
the superuser named <<<super>>> can connect
only from <<<host1>>> and <<<host2>>>
to impersonate a user belonging to <<<group1>>> and <<<group2>>>.
----
<property>
<name>hadoop.proxyuser.super.hosts</name>
<value>host1,host2</value>
</property>
<property>
<name>hadoop.proxyuser.super.groups</name>
<value>group1,group2</value>
</property>
----
If these configurations are not present, impersonation will not be
allowed and connection will fail.
If more lax security is preferred, the wildcard value * may be used to
allow impersonation from any host or of any user.
For example, by specifying as below in core-site.xml,
user named <<<oozie>>> accessing from any host
can impersonate any user belonging to any group.
----
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
----
The <<<hadoop.proxyuser.${superuser}.hosts>>> accepts list of ip addresses,
ip address ranges in CIDR format and/or host names.
For example, by specifying as below,
user named <<<super>>> accessing from hosts in the range
<<<10.222.0.0-15>>> and <<<10.113.221.221>>> can impersonate
<<<user1>>> and <<<user2>>>.
----
<property>
<name>hadoop.proxyuser.super.hosts</name>
<value>10.222.0.0/16,10.113.221.221</value>
</property>
<property>
<name>hadoop.proxyuser.super.users</name>
<value>user1,user2</value>
</property>
----
* Caveats
If the cluster is running in {{{./SecureMode.html}Secure Mode}},
the superuser must have kerberos credentials to be able to impersonate
another user.
It cannot use delegation tokens for this feature. It
would be wrong if superuser adds its own delegation token to the proxy
user ugi, as it will allow the proxy user to connect to the service
with the privileges of the superuser.
However, if the superuser does want to give a delegation token to joe,
it must first impersonate joe and get a delegation token for joe, in
the same way as the code example above, and add it to the ugi of joe.
In this way the delegation token will have the owner as joe.

View File

@ -1,233 +0,0 @@
~~ Licensed under the Apache License, Version 2.0 (the "License");
~~ you may not use this file except in compliance with the License.
~~ You may obtain a copy of the License at
~~
~~ http://www.apache.org/licenses/LICENSE-2.0
~~
~~ Unless required by applicable law or agreed to in writing, software
~~ distributed under the License is distributed on an "AS IS" BASIS,
~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~~ See the License for the specific language governing permissions and
~~ limitations under the License. See accompanying LICENSE file.
---
Hadoop Distributed File System-${project.version} - Enabling Dapper-like Tracing
---
---
${maven.build.timestamp}
Enabling Dapper-like Tracing in Hadoop
%{toc|section=1|fromDepth=0}
* {Dapper-like Tracing in Hadoop}
** HTrace
{{{https://issues.apache.org/jira/browse/HDFS-5274}HDFS-5274}}
added support for tracing requests through HDFS,
using the open source tracing library, {{{https://git-wip-us.apache.org/repos/asf/incubator-htrace.git}Apache HTrace}}.
Setting up tracing is quite simple, however it requires some very minor changes to your client code.
** Samplers
Configure the samplers in <<<core-site.xml>>> property: <<<hadoop.htrace.sampler>>>.
The value can be NeverSampler, AlwaysSampler or ProbabilitySampler. NeverSampler: HTrace is OFF
for all spans; AlwaysSampler: HTrace is ON for all spans; ProbabilitySampler: HTrace is ON for
some percentage% of top-level spans.
+----
<property>
<name>hadoop.htrace.sampler</name>
<value>NeverSampler</value>
</property>
+----
** SpanReceivers
The tracing system works by collecting information in structs called 'Spans'.
It is up to you to choose how you want to receive this information
by implementing the SpanReceiver interface, which defines one method:
+----
public void receiveSpan(Span span);
+----
Configure what SpanReceivers you'd like to use
by putting a comma separated list of the fully-qualified class name of
classes implementing SpanReceiver
in <<<core-site.xml>>> property: <<<hadoop.htrace.spanreceiver.classes>>>.
+----
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>org.apache.htrace.impl.LocalFileSpanReceiver</value>
</property>
<property>
<name>hadoop.htrace.local-file-span-receiver.path</name>
<value>/var/log/hadoop/htrace.out</value>
</property>
+----
You can omit package name prefix if you use span receiver bundled with HTrace.
+----
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>LocalFileSpanReceiver</value>
</property>
+----
** Setting up ZipkinSpanReceiver
Instead of implementing SpanReceiver by yourself,
you can use <<<ZipkinSpanReceiver>>> which uses
{{{https://github.com/twitter/zipkin}Zipkin}}
for collecting and displaying tracing data.
In order to use <<<ZipkinSpanReceiver>>>,
you need to download and setup {{{https://github.com/twitter/zipkin}Zipkin}} first.
you also need to add the jar of <<<htrace-zipkin>>> to the classpath of Hadoop on each node.
Here is example setup procedure.
+----
$ git clone https://github.com/cloudera/htrace
$ cd htrace/htrace-zipkin
$ mvn compile assembly:single
$ cp target/htrace-zipkin-*-jar-with-dependencies.jar $HADOOP_HOME/share/hadoop/common/lib/
+----
The sample configuration for <<<ZipkinSpanReceiver>>> is shown below.
By adding these to <<<core-site.xml>>> of NameNode and DataNodes,
<<<ZipkinSpanReceiver>>> is initialized on the startup.
You also need this configuration on the client node in addition to the servers.
+----
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>ZipkinSpanReceiver</value>
</property>
<property>
<name>hadoop.htrace.zipkin.collector-hostname</name>
<value>192.168.1.2</value>
</property>
<property>
<name>hadoop.htrace.zipkin.collector-port</name>
<value>9410</value>
</property>
+----
** Dynamic update of tracing configuration
You can use <<<hadoop trace>>> command to see and update the tracing configuration of each servers.
You must specify IPC server address of namenode or datanode by <<<-host>>> option.
You need to run the command against all servers if you want to update the configuration of all servers.
<<<hadoop trace -list>>> shows list of loaded span receivers associated with the id.
+----
$ hadoop trace -list -host 192.168.56.2:9000
ID CLASS
1 org.apache.htrace.impl.LocalFileSpanReceiver
$ hadoop trace -list -host 192.168.56.2:50020
ID CLASS
1 org.apache.htrace.impl.LocalFileSpanReceiver
+----
<<<hadoop trace -remove>>> removes span receiver from server.
<<<-remove>>> options takes id of span receiver as argument.
+----
$ hadoop trace -remove 1 -host 192.168.56.2:9000
Removed trace span receiver 1
+----
<<<hadoop trace -add>>> adds span receiver to server.
You need to specify the class name of span receiver as argument of <<<-class>>> option.
You can specify the configuration associated with span receiver by <<<-Ckey=value>>> options.
+----
$ hadoop trace -add -class LocalFileSpanReceiver -Chadoop.htrace.local-file-span-receiver.path=/tmp/htrace.out -host 192.168.56.2:9000
Added trace span receiver 2 with configuration hadoop.htrace.local-file-span-receiver.path = /tmp/htrace.out
$ hadoop trace -list -host 192.168.56.2:9000
ID CLASS
2 org.apache.htrace.impl.LocalFileSpanReceiver
+----
** Starting tracing spans by HTrace API
In order to trace,
you will need to wrap the traced logic with <<tracing span>> as shown below.
When there is running tracing spans,
the tracing information is propagated to servers along with RPC requests.
In addition, you need to initialize <<<SpanReceiver>>> once per process.
+----
import org.apache.hadoop.hdfs.HdfsConfiguration;
import org.apache.hadoop.tracing.SpanReceiverHost;
import org.apache.htrace.Sampler;
import org.apache.htrace.Trace;
import org.apache.htrace.TraceScope;
...
SpanReceiverHost.getInstance(new HdfsConfiguration());
...
TraceScope ts = Trace.startSpan("Gets", Sampler.ALWAYS);
try {
... // traced logic
} finally {
if (ts != null) ts.close();
}
+----
** Sample code for tracing
The <<<TracingFsShell.java>>> shown below is the wrapper of FsShell
which start tracing span before invoking HDFS shell command.
+----
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FsShell;
import org.apache.hadoop.tracing.SpanReceiverHost;
import org.apache.hadoop.util.ToolRunner;
import org.apache.htrace.Sampler;
import org.apache.htrace.Trace;
import org.apache.htrace.TraceScope;
public class TracingFsShell {
public static void main(String argv[]) throws Exception {
Configuration conf = new Configuration();
FsShell shell = new FsShell();
conf.setQuietMode(false);
shell.setConf(conf);
SpanReceiverHost.getInstance(conf);
int res = 0;
TraceScope ts = null;
try {
ts = Trace.startSpan("FsShell", Sampler.ALWAYS);
res = ToolRunner.run(shell, argv);
} finally {
shell.close();
if (ts != null) ts.close();
}
System.exit(res);
}
}
+----
You can compile and execute this code as shown below.
+----
$ javac -cp `hadoop classpath` TracingFsShell.java
$ java -cp .:`hadoop classpath` TracingFsShell -ls /
+----

View File

@ -0,0 +1,68 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop: CLI MiniCluster.
========================
* [Hadoop: CLI MiniCluster.](#Hadoop:_CLI_MiniCluster.)
* [Purpose](#Purpose)
* [Hadoop Tarball](#Hadoop_Tarball)
* [Running the MiniCluster](#Running_the_MiniCluster)
Purpose
-------
Using the CLI MiniCluster, users can simply start and stop a single-node Hadoop cluster with a single command, and without the need to set any environment variables or manage configuration files. The CLI MiniCluster starts both a `YARN`/`MapReduce` & `HDFS` clusters.
This is useful for cases where users want to quickly experiment with a real Hadoop cluster or test non-Java programs that rely on significant Hadoop functionality.
Hadoop Tarball
--------------
You should be able to obtain the Hadoop tarball from the release. Also, you can directly create a tarball from the source:
$ mvn clean install -DskipTests
$ mvn package -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip
**NOTE:** You will need [protoc 2.5.0](http://code.google.com/p/protobuf/) installed.
The tarball should be available in `hadoop-dist/target/` directory.
Running the MiniCluster
-----------------------
From inside the root directory of the extracted tarball, you can start the CLI MiniCluster using the following command:
$ bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-${project.version}-tests.jar minicluster -rmport RM_PORT -jhsport JHS_PORT
In the example command above, `RM_PORT` and `JHS_PORT` should be replaced by the user's choice of these port numbers. If not specified, random free ports will be used.
There are a number of command line arguments that the users can use to control which services to start, and to pass other configuration properties. The available command line arguments:
$ -D <property=value> Options to pass into configuration object
$ -datanodes <arg> How many datanodes to start (default 1)
$ -format Format the DFS (default false)
$ -help Prints option help.
$ -jhsport <arg> JobHistoryServer port (default 0--we choose)
$ -namenode <arg> URL of the namenode (default is either the DFS
$ cluster or a temporary dir)
$ -nnport <arg> NameNode port (default 0--we choose)
$ -nodemanagers <arg> How many nodemanagers to start (default 1)
$ -nodfs Don't start a mini DFS cluster
$ -nomr Don't start a mini MR cluster
$ -rmport <arg> ResourceManager port (default 0--we choose)
$ -writeConfig <path> Save configuration to this XML file.
$ -writeDetails <path> Write basic information to this JSON file.
To display this full list of available arguments, the user can pass the `-help` argument to the above command.

View File

@ -0,0 +1,313 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Apache Hadoop Compatibility
===========================
* [Apache Hadoop Compatibility](#Apache_Hadoop_Compatibility)
* [Purpose](#Purpose)
* [Compatibility types](#Compatibility_types)
* [Java API](#Java_API)
* [Use Cases](#Use_Cases)
* [Policy](#Policy)
* [Semantic compatibility](#Semantic_compatibility)
* [Policy](#Policy)
* [Wire compatibility](#Wire_compatibility)
* [Use Cases](#Use_Cases)
* [Policy](#Policy)
* [Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI](#Java_Binary_compatibility_for_end-user_applications_i.e._Apache_Hadoop_ABI)
* [Use cases](#Use_cases)
* [Policy](#Policy)
* [REST APIs](#REST_APIs)
* [Policy](#Policy)
* [Metrics/JMX](#MetricsJMX)
* [Policy](#Policy)
* [File formats & Metadata](#File_formats__Metadata)
* [User-level file formats](#User-level_file_formats)
* [Policy](#Policy)
* [System-internal file formats](#System-internal_file_formats)
* [MapReduce](#MapReduce)
* [Policy](#Policy)
* [HDFS Metadata](#HDFS_Metadata)
* [Policy](#Policy)
* [Command Line Interface (CLI)](#Command_Line_Interface_CLI)
* [Policy](#Policy)
* [Web UI](#Web_UI)
* [Policy](#Policy)
* [Hadoop Configuration Files](#Hadoop_Configuration_Files)
* [Policy](#Policy)
* [Directory Structure](#Directory_Structure)
* [Policy](#Policy)
* [Java Classpath](#Java_Classpath)
* [Policy](#Policy)
* [Environment variables](#Environment_variables)
* [Policy](#Policy)
* [Build artifacts](#Build_artifacts)
* [Policy](#Policy)
* [Hardware/Software Requirements](#HardwareSoftware_Requirements)
* [Policies](#Policies)
* [References](#References)
Purpose
-------
This document captures the compatibility goals of the Apache Hadoop project. The different types of compatibility between Hadoop releases that affects Hadoop developers, downstream projects, and end-users are enumerated. For each type of compatibility we:
* describe the impact on downstream projects or end-users
* where applicable, call out the policy adopted by the Hadoop developers when incompatible changes are permitted.
Compatibility types
-------------------
### Java API
Hadoop interfaces and classes are annotated to describe the intended audience and stability in order to maintain compatibility with previous releases. See [Hadoop Interface Classification](./InterfaceClassification.html) for details.
* InterfaceAudience: captures the intended audience, possible values are Public (for end users and external projects), LimitedPrivate (for other Hadoop components, and closely related projects like YARN, MapReduce, HBase etc.), and Private (for intra component use).
* InterfaceStability: describes what types of interface changes are permitted. Possible values are Stable, Evolving, Unstable, and Deprecated.
#### Use Cases
* Public-Stable API compatibility is required to ensure end-user programs and downstream projects continue to work without modification.
* LimitedPrivate-Stable API compatibility is required to allow upgrade of individual components across minor releases.
* Private-Stable API compatibility is required for rolling upgrades.
#### Policy
* Public-Stable APIs must be deprecated for at least one major release prior to their removal in a major release.
* LimitedPrivate-Stable APIs can change across major releases, but not within a major release.
* Private-Stable APIs can change across major releases, but not within a major release.
* Classes not annotated are implicitly "Private". Class members not annotated inherit the annotations of the enclosing class.
* Note: APIs generated from the proto files need to be compatible for rolling-upgrades. See the section on wire-compatibility for more details. The compatibility policies for APIs and wire-communication need to go hand-in-hand to address this.
### Semantic compatibility
Apache Hadoop strives to ensure that the behavior of APIs remains consistent over versions, though changes for correctness may result in changes in behavior. Tests and javadocs specify the API's behavior. The community is in the process of specifying some APIs more rigorously, and enhancing test suites to verify compliance with the specification, effectively creating a formal specification for the subset of behaviors that can be easily tested.
#### Policy
The behavior of API may be changed to fix incorrect behavior, such a change to be accompanied by updating existing buggy tests or adding tests in cases there were none prior to the change.
### Wire compatibility
Wire compatibility concerns data being transmitted over the wire between Hadoop processes. Hadoop uses Protocol Buffers for most RPC communication. Preserving compatibility requires prohibiting modification as described below. Non-RPC communication should be considered as well, for example using HTTP to transfer an HDFS image as part of snapshotting or transferring MapTask output. The potential communications can be categorized as follows:
* Client-Server: communication between Hadoop clients and servers (e.g., the HDFS client to NameNode protocol, or the YARN client to ResourceManager protocol).
* Client-Server (Admin): It is worth distinguishing a subset of the Client-Server protocols used solely by administrative commands (e.g., the HAAdmin protocol) as these protocols only impact administrators who can tolerate changes that end users (which use general Client-Server protocols) can not.
* Server-Server: communication between servers (e.g., the protocol between the DataNode and NameNode, or NodeManager and ResourceManager)
#### Use Cases
* Client-Server compatibility is required to allow users to continue using the old clients even after upgrading the server (cluster) to a later version (or vice versa). For example, a Hadoop 2.1.0 client talking to a Hadoop 2.3.0 cluster.
* Client-Server compatibility is also required to allow users to upgrade the client before upgrading the server (cluster). For example, a Hadoop 2.4.0 client talking to a Hadoop 2.3.0 cluster. This allows deployment of client-side bug fixes ahead of full cluster upgrades. Note that new cluster features invoked by new client APIs or shell commands will not be usable. YARN applications that attempt to use new APIs (including new fields in data structures) that have not yet deployed to the cluster can expect link exceptions.
* Client-Server compatibility is also required to allow upgrading individual components without upgrading others. For example, upgrade HDFS from version 2.1.0 to 2.2.0 without upgrading MapReduce.
* Server-Server compatibility is required to allow mixed versions within an active cluster so the cluster may be upgraded without downtime in a rolling fashion.
#### Policy
* Both Client-Server and Server-Server compatibility is preserved within a major release. (Different policies for different categories are yet to be considered.)
* Compatibility can be broken only at a major release, though breaking compatibility even at major releases has grave consequences and should be discussed in the Hadoop community.
* Hadoop protocols are defined in .proto (ProtocolBuffers) files. Client-Server protocols and Server-protocol .proto files are marked as stable. When a .proto file is marked as stable it means that changes should be made in a compatible fashion as described below:
* The following changes are compatible and are allowed at any time:
* Add an optional field, with the expectation that the code deals with the field missing due to communication with an older version of the code.
* Add a new rpc/method to the service
* Add a new optional request to a Message
* Rename a field
* Rename a .proto file
* Change .proto annotations that effect code generation (e.g. name of java package)
* The following changes are incompatible but can be considered only at a major release
* Change the rpc/method name
* Change the rpc/method parameter type or return type
* Remove an rpc/method
* Change the service name
* Change the name of a Message
* Modify a field type in an incompatible way (as defined recursively)
* Change an optional field to required
* Add or delete a required field
* Delete an optional field as long as the optional field has reasonable defaults to allow deletions
* The following changes are incompatible and hence never allowed
* Change a field id
* Reuse an old field that was previously deleted.
* Field numbers are cheap and changing and reusing is not a good idea.
### Java Binary compatibility for end-user applications i.e. Apache Hadoop ABI
As Apache Hadoop revisions are upgraded end-users reasonably expect that their applications should continue to work without any modifications. This is fulfilled as a result of support API compatibility, Semantic compatibility and Wire compatibility.
However, Apache Hadoop is a very complex, distributed system and services a very wide variety of use-cases. In particular, Apache Hadoop MapReduce is a very, very wide API; in the sense that end-users may make wide-ranging assumptions such as layout of the local disk when their map/reduce tasks are executing, environment variables for their tasks etc. In such cases, it becomes very hard to fully specify, and support, absolute compatibility.
#### Use cases
* Existing MapReduce applications, including jars of existing packaged end-user applications and projects such as Apache Pig, Apache Hive, Cascading etc. should work unmodified when pointed to an upgraded Apache Hadoop cluster within a major release.
* Existing YARN applications, including jars of existing packaged end-user applications and projects such as Apache Tez etc. should work unmodified when pointed to an upgraded Apache Hadoop cluster within a major release.
* Existing applications which transfer data in/out of HDFS, including jars of existing packaged end-user applications and frameworks such as Apache Flume, should work unmodified when pointed to an upgraded Apache Hadoop cluster within a major release.
#### Policy
* Existing MapReduce, YARN & HDFS applications and frameworks should work unmodified within a major release i.e. Apache Hadoop ABI is supported.
* A very minor fraction of applications maybe affected by changes to disk layouts etc., the developer community will strive to minimize these changes and will not make them within a minor version. In more egregious cases, we will consider strongly reverting these breaking changes and invalidating offending releases if necessary.
* In particular for MapReduce applications, the developer community will try our best to support provide binary compatibility across major releases e.g. applications using org.apache.hadoop.mapred.
* APIs are supported compatibly across hadoop-1.x and hadoop-2.x. See [Compatibility for MapReduce applications between hadoop-1.x and hadoop-2.x](../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html) for more details.
### REST APIs
REST API compatibility corresponds to both the request (URLs) and responses to each request (content, which may contain other URLs). Hadoop REST APIs are specifically meant for stable use by clients across releases, even major releases. The following are the exposed REST APIs:
* [WebHDFS](../hadoop-hdfs/WebHDFS.html) - Stable
* [ResourceManager](../../hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html)
* [NodeManager](../../hadoop-yarn/hadoop-yarn-site/NodeManagerRest.html)
* [MR Application Master](../../hadoop-yarn/hadoop-yarn-site/MapredAppMasterRest.html)
* [History Server](../../hadoop-yarn/hadoop-yarn-site/HistoryServerRest.html)
#### Policy
The APIs annotated stable in the text above preserve compatibility across at least one major release, and maybe deprecated by a newer version of the REST API in a major release.
### Metrics/JMX
While the Metrics API compatibility is governed by Java API compatibility, the actual metrics exposed by Hadoop need to be compatible for users to be able to automate using them (scripts etc.). Adding additional metrics is compatible. Modifying (eg changing the unit or measurement) or removing existing metrics breaks compatibility. Similarly, changes to JMX MBean object names also break compatibility.
#### Policy
Metrics should preserve compatibility within the major release.
### File formats & Metadata
User and system level data (including metadata) is stored in files of different formats. Changes to the metadata or the file formats used to store data/metadata can lead to incompatibilities between versions.
#### User-level file formats
Changes to formats that end-users use to store their data can prevent them for accessing the data in later releases, and hence it is highly important to keep those file-formats compatible. One can always add a "new" format improving upon an existing format. Examples of these formats include har, war, SequenceFileFormat etc.
##### Policy
* Non-forward-compatible user-file format changes are restricted to major releases. When user-file formats change, new releases are expected to read existing formats, but may write data in formats incompatible with prior releases. Also, the community shall prefer to create a new format that programs must opt in to instead of making incompatible changes to existing formats.
#### System-internal file formats
Hadoop internal data is also stored in files and again changing these formats can lead to incompatibilities. While such changes are not as devastating as the user-level file formats, a policy on when the compatibility can be broken is important.
##### MapReduce
MapReduce uses formats like I-File to store MapReduce-specific data.
##### Policy
MapReduce-internal formats like IFile maintain compatibility within a major release. Changes to these formats can cause in-flight jobs to fail and hence we should ensure newer clients can fetch shuffle-data from old servers in a compatible manner.
##### HDFS Metadata
HDFS persists metadata (the image and edit logs) in a particular format. Incompatible changes to either the format or the metadata prevent subsequent releases from reading older metadata. Such incompatible changes might require an HDFS "upgrade" to convert the metadata to make it accessible. Some changes can require more than one such "upgrades".
Depending on the degree of incompatibility in the changes, the following potential scenarios can arise:
* Automatic: The image upgrades automatically, no need for an explicit "upgrade".
* Direct: The image is upgradable, but might require one explicit release "upgrade".
* Indirect: The image is upgradable, but might require upgrading to intermediate release(s) first.
* Not upgradeable: The image is not upgradeable.
##### Policy
* A release upgrade must allow a cluster to roll-back to the older version and its older disk format. The rollback needs to restore the original data, but not required to restore the updated data.
* HDFS metadata changes must be upgradeable via any of the upgrade paths - automatic, direct or indirect.
* More detailed policies based on the kind of upgrade are yet to be considered.
### Command Line Interface (CLI)
The Hadoop command line programs may be use either directly via the system shell or via shell scripts. Changing the path of a command, removing or renaming command line options, the order of arguments, or the command return code and output break compatibility and may adversely affect users.
#### Policy
CLI commands are to be deprecated (warning when used) for one major release before they are removed or incompatibly modified in a subsequent major release.
### Web UI
Web UI, particularly the content and layout of web pages, changes could potentially interfere with attempts to screen scrape the web pages for information.
#### Policy
Web pages are not meant to be scraped and hence incompatible changes to them are allowed at any time. Users are expected to use REST APIs to get any information.
### Hadoop Configuration Files
Users use (1) Hadoop-defined properties to configure and provide hints to Hadoop and (2) custom properties to pass information to jobs. Hence, compatibility of config properties is two-fold:
* Modifying key-names, units of values, and default values of Hadoop-defined properties.
* Custom configuration property keys should not conflict with the namespace of Hadoop-defined properties. Typically, users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
#### Policy
* Hadoop-defined properties are to be deprecated at least for one major release before being removed. Modifying units for existing properties is not allowed.
* The default values of Hadoop-defined properties can be changed across minor/major releases, but will remain the same across point releases within a minor release.
* Currently, there is NO explicit policy regarding when new prefixes can be added/removed, and the list of prefixes to be avoided for custom configuration properties. However, as noted above, users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
### Directory Structure
Source code, artifacts (source and tests), user logs, configuration files, output and job history are all stored on disk either local file system or HDFS. Changing the directory structure of these user-accessible files break compatibility, even in cases where the original path is preserved via symbolic links (if, for example, the path is accessed by a servlet that is configured to not follow symbolic links).
#### Policy
* The layout of source code and build artifacts can change anytime, particularly so across major versions. Within a major version, the developers will attempt (no guarantees) to preserve the directory structure; however, individual files can be added/moved/deleted. The best way to ensure patches stay in sync with the code is to get them committed to the Apache source tree.
* The directory structure of configuration files, user logs, and job history will be preserved across minor and point releases within a major release.
### Java Classpath
User applications built against Hadoop might add all Hadoop jars (including Hadoop's library dependencies) to the application's classpath. Adding new dependencies or updating the version of existing dependencies may interfere with those in applications' classpaths.
#### Policy
Currently, there is NO policy on when Hadoop's dependencies can change.
### Environment variables
Users and related projects often utilize the exported environment variables (eg HADOOP\_CONF\_DIR), therefore removing or renaming environment variables is an incompatible change.
#### Policy
Currently, there is NO policy on when the environment variables can change. Developers try to limit changes to major releases.
### Build artifacts
Hadoop uses maven for project management and changing the artifacts can affect existing user workflows.
#### Policy
* Test artifacts: The test jars generated are strictly for internal use and are not expected to be used outside of Hadoop, similar to APIs annotated @Private, @Unstable.
* Built artifacts: The hadoop-client artifact (maven groupId:artifactId) stays compatible within a major release, while the other artifacts can change in incompatible ways.
### Hardware/Software Requirements
To keep up with the latest advances in hardware, operating systems, JVMs, and other software, new Hadoop releases or some of their features might require higher versions of the same. For a specific environment, upgrading Hadoop might require upgrading other dependent software components.
#### Policies
* Hardware
* Architecture: The community has no plans to restrict Hadoop to specific architectures, but can have family-specific optimizations.
* Minimum resources: While there are no guarantees on the minimum resources required by Hadoop daemons, the community attempts to not increase requirements within a minor release.
* Operating Systems: The community will attempt to maintain the same OS requirements (OS kernel versions) within a minor release. Currently GNU/Linux and Microsoft Windows are the OSes officially supported by the community while Apache Hadoop is known to work reasonably well on other OSes such as Apple MacOSX, Solaris etc.
* The JVM requirements will not change across point releases within the same minor release except if the JVM version under question becomes unsupported. Minor/major releases might require later versions of JVM for some/all of the supported operating systems.
* Other software: The community tries to maintain the minimum versions of additional software required by Hadoop. For example, ssh, kerberos etc.
References
----------
Here are some relevant JIRAs and pages related to the topic:
* The evolution of this document - [HADOOP-9517](https://issues.apache.org/jira/browse/HADOOP-9517)
* Binary compatibility for MapReduce end-user applications between hadoop-1.x and hadoop-2.x - [MapReduce Compatibility between hadoop-1.x and hadoop-2.x](../../hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduce_Compatibility_Hadoop1_Hadoop2.html)
* Annotations for interfaces as per interface classification schedule - [HADOOP-7391](https://issues.apache.org/jira/browse/HADOOP-7391) [Hadoop Interface Classification](./InterfaceClassification.html)
* Compatibility for Hadoop 1.x releases - [HADOOP-5071](https://issues.apache.org/jira/browse/HADOOP-5071)
* The [Hadoop Roadmap](http://wiki.apache.org/hadoop/Roadmap) page that captures other release policies

View File

@ -0,0 +1,288 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Deprecated Properties
=====================
The following table lists the configuration property names that are deprecated in this version of Hadoop, and their replacements.
| **Deprecated property name** | **New property name** |
|:---- |:---- |
| create.empty.dir.if.nonexist | mapreduce.jobcontrol.createdir.ifnotexist |
| dfs.access.time.precision | dfs.namenode.accesstime.precision |
| dfs.backup.address | dfs.namenode.backup.address |
| dfs.backup.http.address | dfs.namenode.backup.http-address |
| dfs.balance.bandwidthPerSec | dfs.datanode.balance.bandwidthPerSec |
| dfs.block.size | dfs.blocksize |
| dfs.data.dir | dfs.datanode.data.dir |
| dfs.datanode.max.xcievers | dfs.datanode.max.transfer.threads |
| dfs.df.interval | fs.df.interval |
| dfs.federation.nameservice.id | dfs.nameservice.id |
| dfs.federation.nameservices | dfs.nameservices |
| dfs.http.address | dfs.namenode.http-address |
| dfs.https.address | dfs.namenode.https-address |
| dfs.https.client.keystore.resource | dfs.client.https.keystore.resource |
| dfs.https.need.client.auth | dfs.client.https.need-auth |
| dfs.max.objects | dfs.namenode.max.objects |
| dfs.max-repl-streams | dfs.namenode.replication.max-streams |
| dfs.name.dir | dfs.namenode.name.dir |
| dfs.name.dir.restore | dfs.namenode.name.dir.restore |
| dfs.name.edits.dir | dfs.namenode.edits.dir |
| dfs.permissions | dfs.permissions.enabled |
| dfs.permissions.supergroup | dfs.permissions.superusergroup |
| dfs.read.prefetch.size | dfs.client.read.prefetch.size |
| dfs.replication.considerLoad | dfs.namenode.replication.considerLoad |
| dfs.replication.interval | dfs.namenode.replication.interval |
| dfs.replication.min | dfs.namenode.replication.min |
| dfs.replication.pending.timeout.sec | dfs.namenode.replication.pending.timeout-sec |
| dfs.safemode.extension | dfs.namenode.safemode.extension |
| dfs.safemode.threshold.pct | dfs.namenode.safemode.threshold-pct |
| dfs.secondary.http.address | dfs.namenode.secondary.http-address |
| dfs.socket.timeout | dfs.client.socket-timeout |
| dfs.umaskmode | fs.permissions.umask-mode |
| dfs.write.packet.size | dfs.client-write-packet-size |
| fs.checkpoint.dir | dfs.namenode.checkpoint.dir |
| fs.checkpoint.edits.dir | dfs.namenode.checkpoint.edits.dir |
| fs.checkpoint.period | dfs.namenode.checkpoint.period |
| fs.default.name | fs.defaultFS |
| hadoop.configured.node.mapping | net.topology.configured.node.mapping |
| hadoop.job.history.location | mapreduce.jobtracker.jobhistory.location |
| hadoop.native.lib | io.native.lib.available |
| hadoop.net.static.resolutions | mapreduce.tasktracker.net.static.resolutions |
| hadoop.pipes.command-file.keep | mapreduce.pipes.commandfile.preserve |
| hadoop.pipes.executable.interpretor | mapreduce.pipes.executable.interpretor |
| hadoop.pipes.executable | mapreduce.pipes.executable |
| hadoop.pipes.java.mapper | mapreduce.pipes.isjavamapper |
| hadoop.pipes.java.recordreader | mapreduce.pipes.isjavarecordreader |
| hadoop.pipes.java.recordwriter | mapreduce.pipes.isjavarecordwriter |
| hadoop.pipes.java.reducer | mapreduce.pipes.isjavareducer |
| hadoop.pipes.partitioner | mapreduce.pipes.partitioner |
| heartbeat.recheck.interval | dfs.namenode.heartbeat.recheck-interval |
| io.bytes.per.checksum | dfs.bytes-per-checksum |
| io.sort.factor | mapreduce.task.io.sort.factor |
| io.sort.mb | mapreduce.task.io.sort.mb |
| io.sort.spill.percent | mapreduce.map.sort.spill.percent |
| jobclient.completion.poll.interval | mapreduce.client.completion.pollinterval |
| jobclient.output.filter | mapreduce.client.output.filter |
| jobclient.progress.monitor.poll.interval | mapreduce.client.progressmonitor.pollinterval |
| job.end.notification.url | mapreduce.job.end-notification.url |
| job.end.retry.attempts | mapreduce.job.end-notification.retry.attempts |
| job.end.retry.interval | mapreduce.job.end-notification.retry.interval |
| job.local.dir | mapreduce.job.local.dir |
| keep.failed.task.files | mapreduce.task.files.preserve.failedtasks |
| keep.task.files.pattern | mapreduce.task.files.preserve.filepattern |
| key.value.separator.in.input.line | mapreduce.input.keyvaluelinerecordreader.key.value.separator |
| local.cache.size | mapreduce.tasktracker.cache.local.size |
| map.input.file | mapreduce.map.input.file |
| map.input.length | mapreduce.map.input.length |
| map.input.start | mapreduce.map.input.start |
| map.output.key.field.separator | mapreduce.map.output.key.field.separator |
| map.output.key.value.fields.spec | mapreduce.fieldsel.map.output.key.value.fields.spec |
| mapred.acls.enabled | mapreduce.cluster.acls.enabled |
| mapred.binary.partitioner.left.offset | mapreduce.partition.binarypartitioner.left.offset |
| mapred.binary.partitioner.right.offset | mapreduce.partition.binarypartitioner.right.offset |
| mapred.cache.archives | mapreduce.job.cache.archives |
| mapred.cache.archives.timestamps | mapreduce.job.cache.archives.timestamps |
| mapred.cache.files | mapreduce.job.cache.files |
| mapred.cache.files.timestamps | mapreduce.job.cache.files.timestamps |
| mapred.cache.localArchives | mapreduce.job.cache.local.archives |
| mapred.cache.localFiles | mapreduce.job.cache.local.files |
| mapred.child.tmp | mapreduce.task.tmp.dir |
| mapred.cluster.average.blacklist.threshold | mapreduce.jobtracker.blacklist.average.threshold |
| mapred.cluster.map.memory.mb | mapreduce.cluster.mapmemory.mb |
| mapred.cluster.max.map.memory.mb | mapreduce.jobtracker.maxmapmemory.mb |
| mapred.cluster.max.reduce.memory.mb | mapreduce.jobtracker.maxreducememory.mb |
| mapred.cluster.reduce.memory.mb | mapreduce.cluster.reducememory.mb |
| mapred.committer.job.setup.cleanup.needed | mapreduce.job.committer.setup.cleanup.needed |
| mapred.compress.map.output | mapreduce.map.output.compress |
| mapred.data.field.separator | mapreduce.fieldsel.data.field.separator |
| mapred.debug.out.lines | mapreduce.task.debugout.lines |
| mapred.healthChecker.interval | mapreduce.tasktracker.healthchecker.interval |
| mapred.healthChecker.script.args | mapreduce.tasktracker.healthchecker.script.args |
| mapred.healthChecker.script.path | mapreduce.tasktracker.healthchecker.script.path |
| mapred.healthChecker.script.timeout | mapreduce.tasktracker.healthchecker.script.timeout |
| mapred.heartbeats.in.second | mapreduce.jobtracker.heartbeats.in.second |
| mapred.hosts.exclude | mapreduce.jobtracker.hosts.exclude.filename |
| mapred.hosts | mapreduce.jobtracker.hosts.filename |
| mapred.inmem.merge.threshold | mapreduce.reduce.merge.inmem.threshold |
| mapred.input.dir.formats | mapreduce.input.multipleinputs.dir.formats |
| mapred.input.dir.mappers | mapreduce.input.multipleinputs.dir.mappers |
| mapred.input.dir | mapreduce.input.fileinputformat.inputdir |
| mapred.input.pathFilter.class | mapreduce.input.pathFilter.class |
| mapred.jar | mapreduce.job.jar |
| mapred.job.classpath.archives | mapreduce.job.classpath.archives |
| mapred.job.classpath.files | mapreduce.job.classpath.files |
| mapred.job.id | mapreduce.job.id |
| mapred.jobinit.threads | mapreduce.jobtracker.jobinit.threads |
| mapred.job.map.memory.mb | mapreduce.map.memory.mb |
| mapred.job.name | mapreduce.job.name |
| mapred.job.priority | mapreduce.job.priority |
| mapred.job.queue.name | mapreduce.job.queuename |
| mapred.job.reduce.input.buffer.percent | mapreduce.reduce.input.buffer.percent |
| mapred.job.reduce.markreset.buffer.percent | mapreduce.reduce.markreset.buffer.percent |
| mapred.job.reduce.memory.mb | mapreduce.reduce.memory.mb |
| mapred.job.reduce.total.mem.bytes | mapreduce.reduce.memory.totalbytes |
| mapred.job.reuse.jvm.num.tasks | mapreduce.job.jvm.numtasks |
| mapred.job.shuffle.input.buffer.percent | mapreduce.reduce.shuffle.input.buffer.percent |
| mapred.job.shuffle.merge.percent | mapreduce.reduce.shuffle.merge.percent |
| mapred.job.tracker.handler.count | mapreduce.jobtracker.handler.count |
| mapred.job.tracker.history.completed.location | mapreduce.jobtracker.jobhistory.completed.location |
| mapred.job.tracker.http.address | mapreduce.jobtracker.http.address |
| mapred.jobtracker.instrumentation | mapreduce.jobtracker.instrumentation |
| mapred.jobtracker.job.history.block.size | mapreduce.jobtracker.jobhistory.block.size |
| mapred.job.tracker.jobhistory.lru.cache.size | mapreduce.jobtracker.jobhistory.lru.cache.size |
| mapred.job.tracker | mapreduce.jobtracker.address |
| mapred.jobtracker.maxtasks.per.job | mapreduce.jobtracker.maxtasks.perjob |
| mapred.job.tracker.persist.jobstatus.active | mapreduce.jobtracker.persist.jobstatus.active |
| mapred.job.tracker.persist.jobstatus.dir | mapreduce.jobtracker.persist.jobstatus.dir |
| mapred.job.tracker.persist.jobstatus.hours | mapreduce.jobtracker.persist.jobstatus.hours |
| mapred.jobtracker.restart.recover | mapreduce.jobtracker.restart.recover |
| mapred.job.tracker.retiredjobs.cache.size | mapreduce.jobtracker.retiredjobs.cache.size |
| mapred.job.tracker.retire.jobs | mapreduce.jobtracker.retirejobs |
| mapred.jobtracker.taskalloc.capacitypad | mapreduce.jobtracker.taskscheduler.taskalloc.capacitypad |
| mapred.jobtracker.taskScheduler | mapreduce.jobtracker.taskscheduler |
| mapred.jobtracker.taskScheduler.maxRunningTasksPerJob | mapreduce.jobtracker.taskscheduler.maxrunningtasks.perjob |
| mapred.join.expr | mapreduce.join.expr |
| mapred.join.keycomparator | mapreduce.join.keycomparator |
| mapred.lazy.output.format | mapreduce.output.lazyoutputformat.outputformat |
| mapred.line.input.format.linespermap | mapreduce.input.lineinputformat.linespermap |
| mapred.linerecordreader.maxlength | mapreduce.input.linerecordreader.line.maxlength |
| mapred.local.dir | mapreduce.cluster.local.dir |
| mapred.local.dir.minspacekill | mapreduce.tasktracker.local.dir.minspacekill |
| mapred.local.dir.minspacestart | mapreduce.tasktracker.local.dir.minspacestart |
| mapred.map.child.env | mapreduce.map.env |
| mapred.map.child.java.opts | mapreduce.map.java.opts |
| mapred.map.child.log.level | mapreduce.map.log.level |
| mapred.map.max.attempts | mapreduce.map.maxattempts |
| mapred.map.output.compression.codec | mapreduce.map.output.compress.codec |
| mapred.mapoutput.key.class | mapreduce.map.output.key.class |
| mapred.mapoutput.value.class | mapreduce.map.output.value.class |
| mapred.mapper.regex.group | mapreduce.mapper.regexmapper..group |
| mapred.mapper.regex | mapreduce.mapper.regex |
| mapred.map.task.debug.script | mapreduce.map.debug.script |
| mapred.map.tasks | mapreduce.job.maps |
| mapred.map.tasks.speculative.execution | mapreduce.map.speculative |
| mapred.max.map.failures.percent | mapreduce.map.failures.maxpercent |
| mapred.max.reduce.failures.percent | mapreduce.reduce.failures.maxpercent |
| mapred.max.split.size | mapreduce.input.fileinputformat.split.maxsize |
| mapred.max.tracker.blacklists | mapreduce.jobtracker.tasktracker.maxblacklists |
| mapred.max.tracker.failures | mapreduce.job.maxtaskfailures.per.tracker |
| mapred.merge.recordsBeforeProgress | mapreduce.task.merge.progress.records |
| mapred.min.split.size | mapreduce.input.fileinputformat.split.minsize |
| mapred.min.split.size.per.node | mapreduce.input.fileinputformat.split.minsize.per.node |
| mapred.min.split.size.per.rack | mapreduce.input.fileinputformat.split.minsize.per.rack |
| mapred.output.compression.codec | mapreduce.output.fileoutputformat.compress.codec |
| mapred.output.compression.type | mapreduce.output.fileoutputformat.compress.type |
| mapred.output.compress | mapreduce.output.fileoutputformat.compress |
| mapred.output.dir | mapreduce.output.fileoutputformat.outputdir |
| mapred.output.key.class | mapreduce.job.output.key.class |
| mapred.output.key.comparator.class | mapreduce.job.output.key.comparator.class |
| mapred.output.value.class | mapreduce.job.output.value.class |
| mapred.output.value.groupfn.class | mapreduce.job.output.group.comparator.class |
| mapred.permissions.supergroup | mapreduce.cluster.permissions.supergroup |
| mapred.pipes.user.inputformat | mapreduce.pipes.inputformat |
| mapred.reduce.child.env | mapreduce.reduce.env |
| mapred.reduce.child.java.opts | mapreduce.reduce.java.opts |
| mapred.reduce.child.log.level | mapreduce.reduce.log.level |
| mapred.reduce.max.attempts | mapreduce.reduce.maxattempts |
| mapred.reduce.parallel.copies | mapreduce.reduce.shuffle.parallelcopies |
| mapred.reduce.slowstart.completed.maps | mapreduce.job.reduce.slowstart.completedmaps |
| mapred.reduce.task.debug.script | mapreduce.reduce.debug.script |
| mapred.reduce.tasks | mapreduce.job.reduces |
| mapred.reduce.tasks.speculative.execution | mapreduce.reduce.speculative |
| mapred.seqbinary.output.key.class | mapreduce.output.seqbinaryoutputformat.key.class |
| mapred.seqbinary.output.value.class | mapreduce.output.seqbinaryoutputformat.value.class |
| mapred.shuffle.connect.timeout | mapreduce.reduce.shuffle.connect.timeout |
| mapred.shuffle.read.timeout | mapreduce.reduce.shuffle.read.timeout |
| mapred.skip.attempts.to.start.skipping | mapreduce.task.skip.start.attempts |
| mapred.skip.map.auto.incr.proc.count | mapreduce.map.skip.proc-count.auto-incr |
| mapred.skip.map.max.skip.records | mapreduce.map.skip.maxrecords |
| mapred.skip.on | mapreduce.job.skiprecords |
| mapred.skip.out.dir | mapreduce.job.skip.outdir |
| mapred.skip.reduce.auto.incr.proc.count | mapreduce.reduce.skip.proc-count.auto-incr |
| mapred.skip.reduce.max.skip.groups | mapreduce.reduce.skip.maxgroups |
| mapred.speculative.execution.slowNodeThreshold | mapreduce.job.speculative.slownodethreshold |
| mapred.speculative.execution.slowTaskThreshold | mapreduce.job.speculative.slowtaskthreshold |
| mapred.speculative.execution.speculativeCap | mapreduce.job.speculative.speculativecap |
| mapred.submit.replication | mapreduce.client.submit.file.replication |
| mapred.system.dir | mapreduce.jobtracker.system.dir |
| mapred.task.cache.levels | mapreduce.jobtracker.taskcache.levels |
| mapred.task.id | mapreduce.task.attempt.id |
| mapred.task.is.map | mapreduce.task.ismap |
| mapred.task.partition | mapreduce.task.partition |
| mapred.task.profile | mapreduce.task.profile |
| mapred.task.profile.maps | mapreduce.task.profile.maps |
| mapred.task.profile.params | mapreduce.task.profile.params |
| mapred.task.profile.reduces | mapreduce.task.profile.reduces |
| mapred.task.timeout | mapreduce.task.timeout |
| mapred.tasktracker.dns.interface | mapreduce.tasktracker.dns.interface |
| mapred.tasktracker.dns.nameserver | mapreduce.tasktracker.dns.nameserver |
| mapred.tasktracker.events.batchsize | mapreduce.tasktracker.events.batchsize |
| mapred.tasktracker.expiry.interval | mapreduce.jobtracker.expire.trackers.interval |
| mapred.task.tracker.http.address | mapreduce.tasktracker.http.address |
| mapred.tasktracker.indexcache.mb | mapreduce.tasktracker.indexcache.mb |
| mapred.tasktracker.instrumentation | mapreduce.tasktracker.instrumentation |
| mapred.tasktracker.map.tasks.maximum | mapreduce.tasktracker.map.tasks.maximum |
| mapred.tasktracker.memory\_calculator\_plugin | mapreduce.tasktracker.resourcecalculatorplugin |
| mapred.tasktracker.memorycalculatorplugin | mapreduce.tasktracker.resourcecalculatorplugin |
| mapred.tasktracker.reduce.tasks.maximum | mapreduce.tasktracker.reduce.tasks.maximum |
| mapred.task.tracker.report.address | mapreduce.tasktracker.report.address |
| mapred.task.tracker.task-controller | mapreduce.tasktracker.taskcontroller |
| mapred.tasktracker.taskmemorymanager.monitoring-interval | mapreduce.tasktracker.taskmemorymanager.monitoringinterval |
| mapred.tasktracker.tasks.sleeptime-before-sigkill | mapreduce.tasktracker.tasks.sleeptimebeforesigkill |
| mapred.temp.dir | mapreduce.cluster.temp.dir |
| mapred.text.key.comparator.options | mapreduce.partition.keycomparator.options |
| mapred.text.key.partitioner.options | mapreduce.partition.keypartitioner.options |
| mapred.textoutputformat.separator | mapreduce.output.textoutputformat.separator |
| mapred.tip.id | mapreduce.task.id |
| mapreduce.combine.class | mapreduce.job.combine.class |
| mapreduce.inputformat.class | mapreduce.job.inputformat.class |
| mapreduce.job.counters.limit | mapreduce.job.counters.max |
| mapreduce.jobtracker.permissions.supergroup | mapreduce.cluster.permissions.supergroup |
| mapreduce.map.class | mapreduce.job.map.class |
| mapreduce.outputformat.class | mapreduce.job.outputformat.class |
| mapreduce.partitioner.class | mapreduce.job.partitioner.class |
| mapreduce.reduce.class | mapreduce.job.reduce.class |
| mapred.used.genericoptionsparser | mapreduce.client.genericoptionsparser.used |
| mapred.userlog.limit.kb | mapreduce.task.userlog.limit.kb |
| mapred.userlog.retain.hours | mapreduce.job.userlog.retain.hours |
| mapred.working.dir | mapreduce.job.working.dir |
| mapred.work.output.dir | mapreduce.task.output.dir |
| min.num.spills.for.combine | mapreduce.map.combine.minspills |
| reduce.output.key.value.fields.spec | mapreduce.fieldsel.reduce.output.key.value.fields.spec |
| security.job.submission.protocol.acl | security.job.client.protocol.acl |
| security.task.umbilical.protocol.acl | security.job.task.protocol.acl |
| sequencefile.filter.class | mapreduce.input.sequencefileinputfilter.class |
| sequencefile.filter.frequency | mapreduce.input.sequencefileinputfilter.frequency |
| sequencefile.filter.regex | mapreduce.input.sequencefileinputfilter.regex |
| session.id | dfs.metrics.session-id |
| slave.host.name | dfs.datanode.hostname |
| slave.host.name | mapreduce.tasktracker.host.name |
| tasktracker.contention.tracking | mapreduce.tasktracker.contention.tracking |
| tasktracker.http.threads | mapreduce.tasktracker.http.threads |
| topology.node.switch.mapping.impl | net.topology.node.switch.mapping.impl |
| topology.script.file.name | net.topology.script.file.name |
| topology.script.number.args | net.topology.script.number.args |
| user.name | mapreduce.job.user.name |
| webinterface.private.actions | mapreduce.jobtracker.webinterface.trusted |
| yarn.app.mapreduce.yarn.app.mapreduce.client-am.ipc.max-retries-on-timeouts | yarn.app.mapreduce.client-am.ipc.max-retries-on-timeouts |
The following table lists additional changes to some configuration properties:
| **Deprecated property name** | **New property name** |
|:---- |:---- |
| mapred.create.symlink | NONE - symlinking is always on |
| mapreduce.job.cache.symlink.create | NONE - symlinking is always on |

View File

@ -0,0 +1,710 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
* [Overview](#Overview)
* [appendToFile](#appendToFile)
* [cat](#cat)
* [checksum](#checksum)
* [chgrp](#chgrp)
* [chmod](#chmod)
* [chown](#chown)
* [copyFromLocal](#copyFromLocal)
* [copyToLocal](#copyToLocal)
* [count](#count)
* [cp](#cp)
* [createSnapshot](#createSnapshot)
* [deleteSnapshot](#deleteSnapshot)
* [df](#df)
* [du](#du)
* [dus](#dus)
* [expunge](#expunge)
* [find](#find)
* [get](#get)
* [getfacl](#getfacl)
* [getfattr](#getfattr)
* [getmerge](#getmerge)
* [help](#help)
* [ls](#ls)
* [lsr](#lsr)
* [mkdir](#mkdir)
* [moveFromLocal](#moveFromLocal)
* [moveToLocal](#moveToLocal)
* [mv](#mv)
* [put](#put)
* [renameSnapshot](#renameSnapshot)
* [rm](#rm)
* [rmdir](#rmdir)
* [rmr](#rmr)
* [setfacl](#setfacl)
* [setfattr](#setfattr)
* [setrep](#setrep)
* [stat](#stat)
* [tail](#tail)
* [test](#test)
* [text](#text)
* [touchz](#touchz)
* [truncate](#truncate)
* [usage](#usage)
Overview
========
The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, HFTP FS, S3 FS, and others. The FS shell is invoked by:
bin/hadoop fs <args>
All FS shell commands take path URIs as arguments. The URI format is `scheme://authority/path`. For HDFS the scheme is `hdfs`, and for the Local FS the scheme is `file`. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as `hdfs://namenodehost/parent/child` or simply as `/parent/child` (given that your configuration is set to point to `hdfs://namenodehost`).
Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.
If HDFS is being used, `hdfs dfs` is a synonym.
See the [Commands Manual](./CommandsManual.html) for generic shell options.
appendToFile
------------
Usage: `hadoop fs -appendToFile <localsrc> ... <dst> `
Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.
* `hadoop fs -appendToFile localfile /user/hadoop/hadoopfile`
* `hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile`
* `hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile`
* `hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile` Reads the input from stdin.
Exit Code:
Returns 0 on success and 1 on error.
cat
---
Usage: `hadoop fs -cat URI [URI ...]`
Copies source paths to stdout.
Example:
* `hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2`
* `hadoop fs -cat file:///file3 /user/hadoop/file4`
Exit Code:
Returns 0 on success and -1 on error.
checksum
--------
Usage: `hadoop fs -checksum URI`
Returns the checksum information of a file.
Example:
* `hadoop fs -checksum hdfs://nn1.example.com/file1`
* `hadoop fs -checksum file:///etc/hosts`
chgrp
-----
Usage: `hadoop fs -chgrp [-R] GROUP URI [URI ...]`
Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the [Permissions Guide](../hadoop-hdfs/HdfsPermissionsGuide.html).
Options
* The -R option will make the change recursively through the directory structure.
chmod
-----
Usage: `hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]`
Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the [Permissions Guide](../hadoop-hdfs/HdfsPermissionsGuide.html).
Options
* The -R option will make the change recursively through the directory structure.
chown
-----
Usage: `hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]`
Change the owner of files. The user must be a super-user. Additional information is in the [Permissions Guide](../hadoop-hdfs/HdfsPermissionsGuide.html).
Options
* The -R option will make the change recursively through the directory structure.
copyFromLocal
-------------
Usage: `hadoop fs -copyFromLocal <localsrc> URI`
Similar to put command, except that the source is restricted to a local file reference.
Options:
* The -f option will overwrite the destination if it already exists.
copyToLocal
-----------
Usage: `hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst> `
Similar to get command, except that the destination is restricted to a local file reference.
count
-----
Usage: `hadoop fs -count [-q] [-h] [-v] <paths> `
Count the number of directories, files and bytes under the paths that match the specified file pattern. The output columns with -count are: DIR\_COUNT, FILE\_COUNT, CONTENT\_SIZE, PATHNAME
The output columns with -count -q are: QUOTA, REMAINING\_QUATA, SPACE\_QUOTA, REMAINING\_SPACE\_QUOTA, DIR\_COUNT, FILE\_COUNT, CONTENT\_SIZE, PATHNAME
The -h option shows sizes in human readable format.
The -v option displays a header line.
Example:
* `hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2`
* `hadoop fs -count -q hdfs://nn1.example.com/file1`
* `hadoop fs -count -q -h hdfs://nn1.example.com/file1`
* `hdfs dfs -count -q -h -v hdfs://nn1.example.com/file1`
Exit Code:
Returns 0 on success and -1 on error.
cp
----
Usage: `hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest> `
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.
'raw.\*' namespace extended attributes are preserved if (1) the source and destination filesystems support them (HDFS only), and (2) all source and destination pathnames are in the /.reserved/raw hierarchy. Determination of whether raw.\* namespace xattrs are preserved is independent of the -p (preserve) flag.
Options:
* The -f option will overwrite the destination if it already exists.
* The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no *arg*, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.
Example:
* `hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2`
* `hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir`
Exit Code:
Returns 0 on success and -1 on error.
createSnapshot
--------------
See [HDFS Snapshots Guide](../hadoop-hdfs/HdfsSnapshots.html).
deleteSnapshot
--------------
See [HDFS Snapshots Guide](../hadoop-hdfs/HdfsSnapshots.html).
df
----
Usage: `hadoop fs -df [-h] URI [URI ...]`
Displays free space.
Options:
* The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
* `hadoop dfs -df /user/hadoop/dir1`
du
----
Usage: `hadoop fs -du [-s] [-h] URI [URI ...]`
Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.
Options:
* The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files.
* The -h option will format file sizes in a "human-readable" fashion (e.g 64.0m instead of 67108864)
Example:
* `hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1`
Exit Code: Returns 0 on success and -1 on error.
dus
---
Usage: `hadoop fs -dus <args> `
Displays a summary of file lengths.
**Note:** This command is deprecated. Instead use `hadoop fs -du -s`.
expunge
-------
Usage: `hadoop fs -expunge`
Empty the Trash. Refer to the [HDFS Architecture Guide](../hadoop-hdfs/HdfsDesign.html) for more information on the Trash feature.
find
----
Usage: `hadoop fs -find <path> ... <expression> ... `
Finds all files that match the specified expression and applies selected actions to them. If no *path* is specified then defaults to the current working directory. If no expression is specified then defaults to -print.
The following primary expressions are recognised:
* -name pattern<br />-iname pattern
Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.
* -print<br />-print0Always
evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.
The following operators are recognised:
* expression -a expression<br />expression -and expression<br />expression expression
Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.
Example:
`hadoop fs -find / -name test -print`
Exit Code:
Returns 0 on success and -1 on error.
get
---
Usage: `hadoop fs -get [-ignorecrc] [-crc] <src> <localdst> `
Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
Example:
* `hadoop fs -get /user/hadoop/file localfile`
* `hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile`
Exit Code:
Returns 0 on success and -1 on error.
getfacl
-------
Usage: `hadoop fs -getfacl [-R] <path> `
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.
Options:
* -R: List the ACLs of all files and directories recursively.
* *path*: File or directory to list.
Examples:
* `hadoop fs -getfacl /file`
* `hadoop fs -getfacl -R /dir`
Exit Code:
Returns 0 on success and non-zero on error.
getfattr
--------
Usage: `hadoop fs -getfattr [-R] -n name | -d [-e en] <path> `
Displays the extended attribute names and values (if any) for a file or directory.
Options:
* -R: Recursively list the attributes for all files and directories.
* -n name: Dump the named extended attribute value.
* -d: Dump all extended attribute values associated with pathname.
* -e *encoding*: Encode values after retrieving them. Valid encodings are "text", "hex", and "base64". Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
* *path*: The file or directory.
Examples:
* `hadoop fs -getfattr -d /file`
* `hadoop fs -getfattr -R -n user.myAttr /dir`
Exit Code:
Returns 0 on success and non-zero on error.
getmerge
--------
Usage: `hadoop fs -getmerge <src> <localdst> [addnl]`
Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally addnl can be set to enable adding a newline character at the end of each file.
help
----
Usage: `hadoop fs -help`
Return usage output.
ls
----
Usage: `hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> `
Options:
* -d: Directories are listed as plain files.
* -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
* -R: Recursively list subdirectories encountered.
* -t: Sort output by modification time (most recent first).
* -S: Sort output by file size.
* -r: Reverse the sort order.
* -u: Use access time rather than modification time for display and sorting.
For a file ls returns stat on the file with the following format:
permissions number_of_replicas userid groupid filesize modification_date modification_time filename
For a directory it returns list of its direct children as in Unix. A directory is listed as:
permissions userid groupid modification_date modification_time dirname
Files within a directory are order by filename by default.
Example:
* `hadoop fs -ls /user/hadoop/file1`
Exit Code:
Returns 0 on success and -1 on error.
lsr
---
Usage: `hadoop fs -lsr <args> `
Recursive version of ls.
**Note:** This command is deprecated. Instead use `hadoop fs -ls -R`
mkdir
-----
Usage: `hadoop fs -mkdir [-p] <paths> `
Takes path uri's as argument and creates directories.
Options:
* The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.
Example:
* `hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2`
* `hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir`
Exit Code:
Returns 0 on success and -1 on error.
moveFromLocal
-------------
Usage: `hadoop fs -moveFromLocal <localsrc> <dst> `
Similar to put command, except that the source localsrc is deleted after it's copied.
moveToLocal
-----------
Usage: `hadoop fs -moveToLocal [-crc] <src> <dst> `
Displays a "Not implemented yet" message.
mv
----
Usage: `hadoop fs -mv URI [URI ...] <dest> `
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.
Example:
* `hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2`
* `hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1`
Exit Code:
Returns 0 on success and -1 on error.
put
---
Usage: `hadoop fs -put <localsrc> ... <dst> `
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system.
* `hadoop fs -put localfile /user/hadoop/hadoopfile`
* `hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir`
* `hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile`
* `hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile` Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
renameSnapshot
--------------
See [HDFS Snapshots Guide](../hadoop-hdfs/HdfsSnapshots.html).
rm
----
Usage: `hadoop fs -rm [-f] [-r |-R] [-skipTrash] URI [URI ...]`
Delete files specified as args.
Options:
* The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
* The -R option deletes the directory and any content under it recursively.
* The -r option is equivalent to -R.
* The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
Example:
* `hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir`
Exit Code:
Returns 0 on success and -1 on error.
rmdir
-----
Usage: `hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]`
Delete a directory.
Options:
* `--ignore-fail-on-non-empty`: When using wildcards, do not fail if a directory still contains files.
Example:
* `hadoop fs -rmdir /user/hadoop/emptydir`
rmr
---
Usage: `hadoop fs -rmr [-skipTrash] URI [URI ...]`
Recursive version of delete.
**Note:** This command is deprecated. Instead use `hadoop fs -rm -r`
setfacl
-------
Usage: `hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>] `
Sets Access Control Lists (ACLs) of files and directories.
Options:
* -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
* -k: Remove the default ACL.
* -R: Apply operations to all files and directories recursively.
* -m: Modify ACL. New entries are added to the ACL, and existing entries are retained.
* -x: Remove specified ACL entries. Other ACL entries are retained.
* ``--set``: Fully replace the ACL, discarding all existing entries. The *acl\_spec* must include entries for user, group, and others for compatibility with permission bits.
* *acl\_spec*: Comma separated list of ACL entries.
* *path*: File or directory to modify.
Examples:
* `hadoop fs -setfacl -m user:hadoop:rw- /file`
* `hadoop fs -setfacl -x user:hadoop /file`
* `hadoop fs -setfacl -b /file`
* `hadoop fs -setfacl -k /dir`
* `hadoop fs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file`
* `hadoop fs -setfacl -R -m user:hadoop:r-x /dir`
* `hadoop fs -setfacl -m default:user:hadoop:r-x /dir`
Exit Code:
Returns 0 on success and non-zero on error.
setfattr
--------
Usage: `hadoop fs -setfattr -n name [-v value] | -x name <path> `
Sets an extended attribute name and value for a file or directory.
Options:
* -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
* -n name: The extended attribute name.
* -v value: The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
* -x name: Remove the extended attribute.
* *path*: The file or directory.
Examples:
* `hadoop fs -setfattr -n user.myAttr -v myValue /file`
* `hadoop fs -setfattr -n user.noValue /file`
* `hadoop fs -setfattr -x user.myAttr /file`
Exit Code:
Returns 0 on success and non-zero on error.
setrep
------
Usage: `hadoop fs -setrep [-R] [-w] <numReplicas> <path> `
Changes the replication factor of a file. If *path* is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at *path*.
Options:
* The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
* The -R flag is accepted for backwards compatibility. It has no effect.
Example:
* `hadoop fs -setrep -w 3 /user/hadoop/dir1`
Exit Code:
Returns 0 on success and -1 on error.
stat
----
Usage: `hadoop fs -stat [format] <path> ...`
Print statistics about the file/directory at \<path\> in the specified format. Format accepts filesize in blocks (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as "yyyy-MM-dd HH:mm:ss" and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.
Example:
* `hadoop fs -stat "%F %u:%g %b %y %n" /file`
Exit Code: Returns 0 on success and -1 on error.
tail
----
Usage: `hadoop fs -tail [-f] URI`
Displays last kilobyte of the file to stdout.
Options:
* The -f option will output appended data as the file grows, as in Unix.
Example:
* `hadoop fs -tail pathname`
Exit Code: Returns 0 on success and -1 on error.
test
----
Usage: `hadoop fs -test -[defsz] URI`
Options:
* -d: f the path is a directory, return 0.
* -e: if the path exists, return 0.
* -f: if the path is a file, return 0.
* -s: if the path is not empty, return 0.
* -z: if the file is zero length, return 0.
Example:
* `hadoop fs -test -e filename`
text
----
Usage: `hadoop fs -text <src> `
Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.
touchz
------
Usage: `hadoop fs -touchz URI [URI ...]`
Create a file of zero length.
Example:
* `hadoop fs -touchz pathname`
Exit Code: Returns 0 on success and -1 on error.
truncate
--------
Usage: `hadoop fs -truncate [-w] <length> <paths>`
Truncate all files that match the specified file pattern to the
specified length.
Options:
* The `-w` flag requests that the command waits for block recovery
to complete, if necessary. Without -w flag the file may remain
unclosed for some time while the recovery is in progress.
During this time file cannot be reopened for append.
Example:
* `hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2`
* `hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1`
usage
-----
Usage: `hadoop fs -usage command`
Return the help for an individual command.

View File

@ -0,0 +1,58 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Authentication for Hadoop HTTP web-consoles
===========================================
* [Authentication for Hadoop HTTP web-consoles](#Authentication_for_Hadoop_HTTP_web-consoles)
* [Introduction](#Introduction)
* [Configuration](#Configuration)
Introduction
------------
This document describes how to configure Hadoop HTTP web-consoles to require user authentication.
By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow access without any form of authentication.
Similarly to Hadoop RPC, Hadoop HTTP web-consoles can be configured to require Kerberos authentication using HTTP SPNEGO protocol (supported by browsers like Firefox and Internet Explorer).
In addition, Hadoop HTTP web-consoles support the equivalent of Hadoop's Pseudo/Simple authentication. If this option is enabled, user must specify their user name in the first browser interaction using the user.name query string parameter. For example: `http://localhost:50030/jobtracker.jsp?user.name=babu`.
If a custom authentication mechanism is required for the HTTP web-consoles, it is possible to implement a plugin to support the alternate authentication mechanism (refer to Hadoop hadoop-auth for details on writing an `AuthenticatorHandler`).
The next section describes how to configure Hadoop HTTP web-consoles to require user authentication.
Configuration
-------------
The following properties should be in the `core-site.xml` of all the nodes in the cluster.
`hadoop.http.filter.initializers`: add to this property the `org.apache.hadoop.security.AuthenticationFilterInitializer` initializer class.
`hadoop.http.authentication.type`: Defines authentication used for the HTTP web-consoles. The supported values are: `simple` | `kerberos` | `#AUTHENTICATION_HANDLER_CLASSNAME#`. The dfeault value is `simple`.
`hadoop.http.authentication.token.validity`: Indicates how long (in seconds) an authentication token is valid before it has to be renewed. The default value is `36000`.
`hadoop.http.authentication.signature.secret.file`: The signature secret file for signing the authentication tokens. The same secret should be used for all nodes in the cluster, JobTracker, NameNode, DataNode and TastTracker. The default value is `$user.home/hadoop-http-auth-signature-secret`. IMPORTANT: This file should be readable only by the Unix user running the daemons.
`hadoop.http.authentication.cookie.domain`: The domain to use for the HTTP cookie that stores the authentication token. In order to authentiation to work correctly across all nodes in the cluster the domain must be correctly set. There is no default value, the HTTP cookie will not have a domain working only with the hostname issuing the HTTP cookie.
IMPORTANT: when using IP addresses, browsers ignore cookies with domain settings. For this setting to work properly all nodes in the cluster must be configured to generate URLs with `hostname.domain` names on it.
`hadoop.http.authentication.simple.anonymous.allowed`: Indicates if anonymous requests are allowed when using 'simple' authentication. The default value is `true`
`hadoop.http.authentication.kerberos.principal`: Indicates the Kerberos principal to be used for HTTP endpoint when using 'kerberos' authentication. The principal short name must be `HTTP` per Kerberos HTTP SPNEGO specification. The default value is `HTTP/_HOST@$LOCALHOST`, where `_HOST` -if present- is replaced with bind address of the HTTP server.
`hadoop.http.authentication.kerberos.keytab`: Location of the keytab file with the credentials for the Kerberos principal used for the HTTP endpoint. The default value is `$user.home/hadoop.keytab`.i

View File

@ -0,0 +1,105 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Hadoop Interface Taxonomy: Audience and Stability Classification
================================================================
* [Hadoop Interface Taxonomy: Audience and Stability Classification](#Hadoop_Interface_Taxonomy:_Audience_and_Stability_Classification)
* [Motivation](#Motivation)
* [Interface Classification](#Interface_Classification)
* [Audience](#Audience)
* [Stability](#Stability)
* [How are the Classifications Recorded?](#How_are_the_Classifications_Recorded)
* [FAQ](#FAQ)
Motivation
----------
The interface taxonomy classification provided here is for guidance to developers and users of interfaces. The classification guides a developer to declare the targeted audience or users of an interface and also its stability.
* Benefits to the user of an interface: Knows which interfaces to use or not use and their stability.
* Benefits to the developer: to prevent accidental changes of interfaces and hence accidental impact on users or other components or system. This is particularly useful in large systems with many developers who may not all have a shared state/history of the project.
Interface Classification
------------------------
Hadoop adopts the following interface classification, this classification was derived from the [OpenSolaris taxonomy](http://www.opensolaris.org/os/community/arc/policies/interface-taxonomy/#Advice) and, to some extent, from taxonomy used inside Yahoo. Interfaces have two main attributes: Audience and Stability
### Audience
Audience denotes the potential consumers of the interface. While many interfaces are internal/private to the implementation, other are public/external interfaces are meant for wider consumption by applications and/or clients. For example, in posix, libc is an external or public interface, while large parts of the kernel are internal or private interfaces. Also, some interfaces are targeted towards other specific subsystems.
Identifying the audience of an interface helps define the impact of breaking it. For instance, it might be okay to break the compatibility of an interface whose audience is a small number of specific subsystems. On the other hand, it is probably not okay to break a protocol interfaces that millions of Internet users depend on.
Hadoop uses the following kinds of audience in order of increasing/wider visibility:
* Private:
* The interface is for internal use within the project (such as HDFS or MapReduce) and should not be used by applications or by other projects. It is subject to change at anytime without notice. Most interfaces of a project are Private (also referred to as project-private).
* Limited-Private:
* The interface is used by a specified set of projects or systems (typically closely related projects). Other projects or systems should not use the interface. Changes to the interface will be communicated/ negotiated with the specified projects. For example, in the Hadoop project, some interfaces are LimitedPrivate{HDFS, MapReduce} in that they are private to the HDFS and MapReduce projects.
* Public
* The interface is for general use by any application.
Hadoop doesn't have a Company-Private classification, which is meant for APIs which are intended to be used by other projects within the company, since it doesn't apply to opensource projects. Also, certain APIs are annotated as @VisibleForTesting (from com.google.common .annotations.VisibleForTesting) - these are meant to be used strictly for unit tests and should be treated as "Private" APIs.
### Stability
Stability denotes how stable an interface is, as in when incompatible changes to the interface are allowed. Hadoop APIs have the following levels of stability.
* Stable
* Can evolve while retaining compatibility for minor release boundaries; in other words, incompatible changes to APIs marked Stable are allowed only at major releases (i.e. at m.0).
* Evolving
* Evolving, but incompatible changes are allowed at minor release (i.e. m .x)
* Unstable
* Incompatible changes to Unstable APIs are allowed any time. This usually makes sense for only private interfaces.
* However one may call this out for a supposedly public interface to highlight that it should not be used as an interface; for public interfaces, labeling it as Not-an-interface is probably more appropriate than "Unstable".
* Examples of publicly visible interfaces that are unstable (i.e. not-an-interface): GUI, CLIs whose output format will change
* Deprecated
* APIs that could potentially removed in the future and should not be used.
How are the Classifications Recorded?
-------------------------------------
How will the classification be recorded for Hadoop APIs?
* Each interface or class will have the audience and stability recorded using annotations in org.apache.hadoop.classification package.
* The javadoc generated by the maven target javadoc:javadoc lists only the public API.
* One can derive the audience of java classes and java interfaces by the audience of the package in which they are contained. Hence it is useful to declare the audience of each java package as public or private (along with the private audience variations).
FAQ
---
* Why arent the java scopes (private, package private and public) good enough?
* Javas scoping is not very complete. One is often forced to make a class public in order for other internal components to use it. It does not have friends or sub-package-private like C++.
* But I can easily access a private implementation interface if it is Java public. Where is the protection and control?
* The purpose of this is not providing absolute access control. Its purpose is to communicate to users and developers. One can access private implementation functions in libc; however if they change the internal implementation details, your application will break and you will have little sympathy from the folks who are supplying libc. If you use a non-public interface you understand the risks.
* Why bother declaring the stability of a private interface? Arent private interfaces always unstable?
* Private interfaces are not always unstable. In the cases where they are stable they capture internal properties of the system and can communicate these properties to its internal users and to developers of the interface.
* e.g. In HDFS, NN-DN protocol is private but stable and can help implement rolling upgrades. It communicates that this interface should not be changed in incompatible ways even though it is private.
* e.g. In HDFS, FSImage stability can help provide more flexible roll backs.
* What is the harm in applications using a private interface that is stable? How is it different than a public stable interface?
* While a private interface marked as stable is targeted to change only at major releases, it may break at other times if the providers of that interface are willing to changes the internal users of that interface. Further, a public stable interface is less likely to break even at major releases (even though it is allowed to break compatibility) because the impact of the change is larger. If you use a private interface (regardless of its stability) you run the risk of incompatibility.
* Why bother with Limited-private? Isnt it giving special treatment to some projects? That is not fair.
* First, most interfaces should be public or private; actually let us state it even stronger: make it private unless you really want to expose it to public for general use.
* Limited-private is for interfaces that are not intended for general use. They are exposed to related projects that need special hooks. Such a classification has a cost to both the supplier and consumer of the limited interface. Both will have to work together if ever there is a need to break the interface in the future; for example the supplier and the consumers will have to work together to get coordinated releases of their respective projects. This should not be taken lightly if you can get away with private then do so; if the interface is really for general use for all applications then do so. But remember that making an interface public has huge responsibility. Sometimes Limited-private is just right.
* A good example of a limited-private interface is BlockLocations, This is fairly low-level interface that we are willing to expose to MR and perhaps HBase. We are likely to change it down the road and at that time we will have get a coordinated effort with the MR team to release matching releases. While MR and HDFS are always released in sync today, they may change down the road.
* If you have a limited-private interface with many projects listed then you are fooling yourself. It is practically public.
* It might be worth declaring a special audience classification called Hadoop-Private for the Hadoop family.
* Lets treat all private interfaces as Hadoop-private. What is the harm in projects in the Hadoop family have access to private classes?
* Do we want MR accessing class files that are implementation details inside HDFS. There used to be many such layer violations in the code that we have been cleaning up over the last few years. We dont want such layer violations to creep back in by no separating between the major components like HDFS and MR.
* Aren't all public interfaces stable?
* One may mark a public interface as evolving in its early days. Here one is promising to make an effort to make compatible changes but may need to break it at minor releases.
* One example of a public interface that is unstable is where one is providing an implementation of a standards-body based interface that is still under development. For example, many companies, in an attampt to be first to market, have provided implementations of a new NFS protocol even when the protocol was not fully completed by IETF. The implementor cannot evolve the interface in a fashion that causes least distruption because the stability is controlled by the standards body. Hence it is appropriate to label the interface as unstable.

View File

@ -0,0 +1,456 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
* [Overview](#Overview)
* [jvm context](#jvm_context)
* [JvmMetrics](#JvmMetrics)
* [rpc context](#rpc_context)
* [rpc](#rpc)
* [RetryCache/NameNodeRetryCache](#RetryCacheNameNodeRetryCache)
* [rpcdetailed context](#rpcdetailed_context)
* [rpcdetailed](#rpcdetailed)
* [dfs context](#dfs_context)
* [namenode](#namenode)
* [FSNamesystem](#FSNamesystem)
* [JournalNode](#JournalNode)
* [datanode](#datanode)
* [yarn context](#yarn_context)
* [ClusterMetrics](#ClusterMetrics)
* [QueueMetrics](#QueueMetrics)
* [NodeManagerMetrics](#NodeManagerMetrics)
* [ugi context](#ugi_context)
* [UgiMetrics](#UgiMetrics)
* [metricssystem context](#metricssystem_context)
* [MetricsSystem](#MetricsSystem)
* [default context](#default_context)
* [StartupProgress](#StartupProgress)
Overview
========
Metrics are statistical information exposed by Hadoop daemons, used for monitoring, performance tuning and debug. There are many metrics available by default and they are very useful for troubleshooting. This page shows the details of the available metrics.
Each section describes each context into which metrics are grouped.
The documentation of Metrics 2.0 framework is [here](../../api/org/apache/hadoop/metrics2/package-summary.html).
jvm context
===========
JvmMetrics
----------
Each metrics record contains tags such as ProcessName, SessionID and Hostname as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `MemNonHeapUsedM` | Current non-heap memory used in MB |
| `MemNonHeapCommittedM` | Current non-heap memory committed in MB |
| `MemNonHeapMaxM` | Max non-heap memory size in MB |
| `MemHeapUsedM` | Current heap memory used in MB |
| `MemHeapCommittedM` | Current heap memory committed in MB |
| `MemHeapMaxM` | Max heap memory size in MB |
| `MemMaxM` | Max memory size in MB |
| `ThreadsNew` | Current number of NEW threads |
| `ThreadsRunnable` | Current number of RUNNABLE threads |
| `ThreadsBlocked` | Current number of BLOCKED threads |
| `ThreadsWaiting` | Current number of WAITING threads |
| `ThreadsTimedWaiting` | Current number of TIMED\_WAITING threads |
| `ThreadsTerminated` | Current number of TERMINATED threads |
| `GcInfo` | Total GC count and GC time in msec, grouped by the kind of GC.  ex.) GcCountPS Scavenge=6, GCTimeMillisPS Scavenge=40, GCCountPS MarkSweep=0, GCTimeMillisPS MarkSweep=0 |
| `GcCount` | Total GC count |
| `GcTimeMillis` | Total GC time in msec |
| `LogFatal` | Total number of FATAL logs |
| `LogError` | Total number of ERROR logs |
| `LogWarn` | Total number of WARN logs |
| `LogInfo` | Total number of INFO logs |
| `GcNumWarnThresholdExceeded` | Number of times that the GC warn threshold is exceeded |
| `GcNumInfoThresholdExceeded` | Number of times that the GC info threshold is exceeded |
| `GcTotalExtraSleepTime` | Total GC extra sleep time in msec |
rpc context
===========
rpc
---
Each metrics record contains tags such as Hostname and port (number to which server is bound) as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `ReceivedBytes` | Total number of received bytes |
| `SentBytes` | Total number of sent bytes |
| `RpcQueueTimeNumOps` | Total number of RPC calls |
| `RpcQueueTimeAvgTime` | Average queue time in milliseconds |
| `RpcProcessingTimeNumOps` | Total number of RPC calls (same to RpcQueueTimeNumOps) |
| `RpcProcessingAvgTime` | Average Processing time in milliseconds |
| `RpcAuthenticationFailures` | Total number of authentication failures |
| `RpcAuthenticationSuccesses` | Total number of authentication successes |
| `RpcAuthorizationFailures` | Total number of authorization failures |
| `RpcAuthorizationSuccesses` | Total number of authorization successes |
| `NumOpenConnections` | Current number of open connections |
| `CallQueueLength` | Current length of the call queue |
| `rpcQueueTime`*num*`sNumOps` | Shows total number of RPC calls (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcQueueTime`*num*`s50thPercentileLatency` | Shows the 50th percentile of RPC queue time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcQueueTime`*num*`s75thPercentileLatency` | Shows the 75th percentile of RPC queue time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcQueueTime`*num*`s90thPercentileLatency` | Shows the 90th percentile of RPC queue time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcQueueTime`*num*`s95thPercentileLatency` | Shows the 95th percentile of RPC queue time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcQueueTime`*num*`s99thPercentileLatency` | Shows the 99th percentile of RPC queue time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`sNumOps` | Shows total number of RPC calls (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`s50thPercentileLatency` | Shows the 50th percentile of RPC processing time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`s75thPercentileLatency` | Shows the 75th percentile of RPC processing time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`s90thPercentileLatency` | Shows the 90th percentile of RPC processing time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`s95thPercentileLatency` | Shows the 95th percentile of RPC processing time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
| `rpcProcessingTime`*num*`s99thPercentileLatency` | Shows the 99th percentile of RPC processing time in milliseconds (*num* seconds granularity) if `rpc.metrics.quantile.enable` is set to true. *num* is specified by `rpc.metrics.percentiles.intervals`. |
RetryCache/NameNodeRetryCache
-----------------------------
RetryCache metrics is useful to monitor NameNode fail-over. Each metrics record contains Hostname tag.
| Name | Description |
|:---- |:---- |
| `CacheHit` | Total number of RetryCache hit |
| `CacheCleared` | Total number of RetryCache cleared |
| `CacheUpdated` | Total number of RetryCache updated |
rpcdetailed context
===================
Metrics of rpcdetailed context are exposed in unified manner by RPC layer. Two metrics are exposed for each RPC based on its name. Metrics named "(RPC method name)NumOps" indicates total number of method calls, and metrics named "(RPC method name)AvgTime" shows average turn around time for method calls in milliseconds.
rpcdetailed
-----------
Each metrics record contains tags such as Hostname and port (number to which server is bound) as additional information along with metrics.
The Metrics about RPCs which is not called are not included in metrics record.
| Name | Description |
|:---- |:---- |
| *methodname*`NumOps` | Total number of the times the method is called |
| *methodname*`AvgTime` | Average turn around time of the method in milliseconds |
dfs context
===========
namenode
--------
Each metrics record contains tags such as ProcessName, SessionId, and Hostname as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `CreateFileOps` | Total number of files created |
| `FilesCreated` | Total number of files and directories created by create or mkdir operations |
| `FilesAppended` | Total number of files appended |
| `GetBlockLocations` | Total number of getBlockLocations operations |
| `FilesRenamed` | Total number of rename **operations** (NOT number of files/dirs renamed) |
| `GetListingOps` | Total number of directory listing operations |
| `DeleteFileOps` | Total number of delete operations |
| `FilesDeleted` | Total number of files and directories deleted by delete or rename operations |
| `FileInfoOps` | Total number of getFileInfo and getLinkFileInfo operations |
| `AddBlockOps` | Total number of addBlock operations succeeded |
| `GetAdditionalDatanodeOps` | Total number of getAdditionalDatanode operations |
| `CreateSymlinkOps` | Total number of createSymlink operations |
| `GetLinkTargetOps` | Total number of getLinkTarget operations |
| `FilesInGetListingOps` | Total number of files and directories listed by directory listing operations |
| `AllowSnapshotOps` | Total number of allowSnapshot operations |
| `DisallowSnapshotOps` | Total number of disallowSnapshot operations |
| `CreateSnapshotOps` | Total number of createSnapshot operations |
| `DeleteSnapshotOps` | Total number of deleteSnapshot operations |
| `RenameSnapshotOps` | Total number of renameSnapshot operations |
| `ListSnapshottableDirOps` | Total number of snapshottableDirectoryStatus operations |
| `SnapshotDiffReportOps` | Total number of getSnapshotDiffReport operations |
| `TransactionsNumOps` | Total number of Journal transactions |
| `TransactionsAvgTime` | Average time of Journal transactions in milliseconds |
| `SyncsNumOps` | Total number of Journal syncs |
| `SyncsAvgTime` | Average time of Journal syncs in milliseconds |
| `TransactionsBatchedInSync` | Total number of Journal transactions batched in sync |
| `BlockReportNumOps` | Total number of processing block reports from DataNode |
| `BlockReportAvgTime` | Average time of processing block reports in milliseconds |
| `CacheReportNumOps` | Total number of processing cache reports from DataNode |
| `CacheReportAvgTime` | Average time of processing cache reports in milliseconds |
| `SafeModeTime` | The interval between FSNameSystem starts and the last time safemode leaves in milliseconds.  (sometimes not equal to the time in SafeMode, see [HDFS-5156](https://issues.apache.org/jira/browse/HDFS-5156)) |
| `FsImageLoadTime` | Time loading FS Image at startup in milliseconds |
| `FsImageLoadTime` | Time loading FS Image at startup in milliseconds |
| `GetEditNumOps` | Total number of edits downloads from SecondaryNameNode |
| `GetEditAvgTime` | Average edits download time in milliseconds |
| `GetImageNumOps` | Total number of fsimage downloads from SecondaryNameNode |
| `GetImageAvgTime` | Average fsimage download time in milliseconds |
| `PutImageNumOps` | Total number of fsimage uploads to SecondaryNameNode |
| `PutImageAvgTime` | Average fsimage upload time in milliseconds |
FSNamesystem
------------
Each metrics record contains tags such as HAState and Hostname as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `MissingBlocks` | Current number of missing blocks |
| `ExpiredHeartbeats` | Total number of expired heartbeats |
| `TransactionsSinceLastCheckpoint` | Total number of transactions since last checkpoint |
| `TransactionsSinceLastLogRoll` | Total number of transactions since last edit log roll |
| `LastWrittenTransactionId` | Last transaction ID written to the edit log |
| `LastCheckpointTime` | Time in milliseconds since epoch of last checkpoint |
| `CapacityTotal` | Current raw capacity of DataNodes in bytes |
| `CapacityTotalGB` | Current raw capacity of DataNodes in GB |
| `CapacityUsed` | Current used capacity across all DataNodes in bytes |
| `CapacityUsedGB` | Current used capacity across all DataNodes in GB |
| `CapacityRemaining` | Current remaining capacity in bytes |
| `CapacityRemainingGB` | Current remaining capacity in GB |
| `CapacityUsedNonDFS` | Current space used by DataNodes for non DFS purposes in bytes |
| `TotalLoad` | Current number of connections |
| `SnapshottableDirectories` | Current number of snapshottable directories |
| `Snapshots` | Current number of snapshots |
| `BlocksTotal` | Current number of allocated blocks in the system |
| `FilesTotal` | Current number of files and directories |
| `PendingReplicationBlocks` | Current number of blocks pending to be replicated |
| `UnderReplicatedBlocks` | Current number of blocks under replicated |
| `CorruptBlocks` | Current number of blocks with corrupt replicas. |
| `ScheduledReplicationBlocks` | Current number of blocks scheduled for replications |
| `PendingDeletionBlocks` | Current number of blocks pending deletion |
| `ExcessBlocks` | Current number of excess blocks |
| `PostponedMisreplicatedBlocks` | (HA-only) Current number of blocks postponed to replicate |
| `PendingDataNodeMessageCourt` | (HA-only) Current number of pending block-related messages for later processing in the standby NameNode |
| `MillisSinceLastLoadedEdits` | (HA-only) Time in milliseconds since the last time standby NameNode load edit log. In active NameNode, set to 0 |
| `BlockCapacity` | Current number of block capacity |
| `StaleDataNodes` | Current number of DataNodes marked stale due to delayed heartbeat |
| `TotalFiles` | Current number of files and directories (same as FilesTotal) |
JournalNode
-----------
The server-side metrics for a journal from the JournalNode's perspective. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `Syncs60sNumOps` | Number of sync operations (1 minute granularity) |
| `Syncs60s50thPercentileLatencyMicros` | The 50th percentile of sync latency in microseconds (1 minute granularity) |
| `Syncs60s75thPercentileLatencyMicros` | The 75th percentile of sync latency in microseconds (1 minute granularity) |
| `Syncs60s90thPercentileLatencyMicros` | The 90th percentile of sync latency in microseconds (1 minute granularity) |
| `Syncs60s95thPercentileLatencyMicros` | The 95th percentile of sync latency in microseconds (1 minute granularity) |
| `Syncs60s99thPercentileLatencyMicros` | The 99th percentile of sync latency in microseconds (1 minute granularity) |
| `Syncs300sNumOps` | Number of sync operations (5 minutes granularity) |
| `Syncs300s50thPercentileLatencyMicros` | The 50th percentile of sync latency in microseconds (5 minutes granularity) |
| `Syncs300s75thPercentileLatencyMicros` | The 75th percentile of sync latency in microseconds (5 minutes granularity) |
| `Syncs300s90thPercentileLatencyMicros` | The 90th percentile of sync latency in microseconds (5 minutes granularity) |
| `Syncs300s95thPercentileLatencyMicros` | The 95th percentile of sync latency in microseconds (5 minutes granularity) |
| `Syncs300s99thPercentileLatencyMicros` | The 99th percentile of sync latency in microseconds (5 minutes granularity) |
| `Syncs3600sNumOps` | Number of sync operations (1 hour granularity) |
| `Syncs3600s50thPercentileLatencyMicros` | The 50th percentile of sync latency in microseconds (1 hour granularity) |
| `Syncs3600s75thPercentileLatencyMicros` | The 75th percentile of sync latency in microseconds (1 hour granularity) |
| `Syncs3600s90thPercentileLatencyMicros` | The 90th percentile of sync latency in microseconds (1 hour granularity) |
| `Syncs3600s95thPercentileLatencyMicros` | The 95th percentile of sync latency in microseconds (1 hour granularity) |
| `Syncs3600s99thPercentileLatencyMicros` | The 99th percentile of sync latency in microseconds (1 hour granularity) |
| `BatchesWritten` | Total number of batches written since startup |
| `TxnsWritten` | Total number of transactions written since startup |
| `BytesWritten` | Total number of bytes written since startup |
| `BatchesWrittenWhileLagging` | Total number of batches written where this node was lagging |
| `LastWriterEpoch` | Current writer's epoch number |
| `CurrentLagTxns` | The number of transactions that this JournalNode is lagging |
| `LastWrittenTxId` | The highest transaction id stored on this JournalNode |
| `LastPromisedEpoch` | The last epoch number which this node has promised not to accept any lower epoch, or 0 if no promises have been made |
datanode
--------
Each metrics record contains tags such as SessionId and Hostname as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `BytesWritten` | Total number of bytes written to DataNode |
| `BytesRead` | Total number of bytes read from DataNode |
| `BlocksWritten` | Total number of blocks written to DataNode |
| `BlocksRead` | Total number of blocks read from DataNode |
| `BlocksReplicated` | Total number of blocks replicated |
| `BlocksRemoved` | Total number of blocks removed |
| `BlocksVerified` | Total number of blocks verified |
| `BlockVerificationFailures` | Total number of verifications failures |
| `BlocksCached` | Total number of blocks cached |
| `BlocksUncached` | Total number of blocks uncached |
| `ReadsFromLocalClient` | Total number of read operations from local client |
| `ReadsFromRemoteClient` | Total number of read operations from remote client |
| `WritesFromLocalClient` | Total number of write operations from local client |
| `WritesFromRemoteClient` | Total number of write operations from remote client |
| `BlocksGetLocalPathInfo` | Total number of operations to get local path names of blocks |
| `FsyncCount` | Total number of fsync |
| `VolumeFailures` | Total number of volume failures occurred |
| `ReadBlockOpNumOps` | Total number of read operations |
| `ReadBlockOpAvgTime` | Average time of read operations in milliseconds |
| `WriteBlockOpNumOps` | Total number of write operations |
| `WriteBlockOpAvgTime` | Average time of write operations in milliseconds |
| `BlockChecksumOpNumOps` | Total number of blockChecksum operations |
| `BlockChecksumOpAvgTime` | Average time of blockChecksum operations in milliseconds |
| `CopyBlockOpNumOps` | Total number of block copy operations |
| `CopyBlockOpAvgTime` | Average time of block copy operations in milliseconds |
| `ReplaceBlockOpNumOps` | Total number of block replace operations |
| `ReplaceBlockOpAvgTime` | Average time of block replace operations in milliseconds |
| `HeartbeatsNumOps` | Total number of heartbeats |
| `HeartbeatsAvgTime` | Average heartbeat time in milliseconds |
| `BlockReportsNumOps` | Total number of block report operations |
| `BlockReportsAvgTime` | Average time of block report operations in milliseconds |
| `CacheReportsNumOps` | Total number of cache report operations |
| `CacheReportsAvgTime` | Average time of cache report operations in milliseconds |
| `PacketAckRoundTripTimeNanosNumOps` | Total number of ack round trip |
| `PacketAckRoundTripTimeNanosAvgTime` | Average time from ack send to receive minus the downstream ack time in nanoseconds |
| `FlushNanosNumOps` | Total number of flushes |
| `FlushNanosAvgTime` | Average flush time in nanoseconds |
| `FsyncNanosNumOps` | Total number of fsync |
| `FsyncNanosAvgTime` | Average fsync time in nanoseconds |
| `SendDataPacketBlockedOnNetworkNanosNumOps` | Total number of sending packets |
| `SendDataPacketBlockedOnNetworkNanosAvgTime` | Average waiting time of sending packets in nanoseconds |
| `SendDataPacketTransferNanosNumOps` | Total number of sending packets |
| `SendDataPacketTransferNanosAvgTime` | Average transfer time of sending packets in nanoseconds |
yarn context
============
ClusterMetrics
--------------
ClusterMetrics shows the metrics of the YARN cluster from the ResourceManager's perspective. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `NumActiveNMs` | Current number of active NodeManagers |
| `NumDecommissionedNMs` | Current number of decommissioned NodeManagers |
| `NumLostNMs` | Current number of lost NodeManagers for not sending heartbeats |
| `NumUnhealthyNMs` | Current number of unhealthy NodeManagers |
| `NumRebootedNMs` | Current number of rebooted NodeManagers |
QueueMetrics
------------
QueueMetrics shows an application queue from the ResourceManager's perspective. Each metrics record shows the statistics of each queue, and contains tags such as queue name and Hostname as additional information along with metrics.
In `running_`*num* metrics such as `running_0`, you can set the property `yarn.resourcemanager.metrics.runtime.buckets` in yarn-site.xml to change the buckets. The default values is `60,300,1440`.
| Name | Description |
|:---- |:---- |
| `running_0` | Current number of running applications whose elapsed time are less than 60 minutes |
| `running_60` | Current number of running applications whose elapsed time are between 60 and 300 minutes |
| `running_300` | Current number of running applications whose elapsed time are between 300 and 1440 minutes |
| `running_1440` | Current number of running applications elapsed time are more than 1440 minutes |
| `AppsSubmitted` | Total number of submitted applications |
| `AppsRunning` | Current number of running applications |
| `AppsPending` | Current number of applications that have not yet been assigned by any containers |
| `AppsCompleted` | Total number of completed applications |
| `AppsKilled` | Total number of killed applications |
| `AppsFailed` | Total number of failed applications |
| `AllocatedMB` | Current allocated memory in MB |
| `AllocatedVCores` | Current allocated CPU in virtual cores |
| `AllocatedContainers` | Current number of allocated containers |
| `AggregateContainersAllocated` | Total number of allocated containers |
| `AggregateContainersReleased` | Total number of released containers |
| `AvailableMB` | Current available memory in MB |
| `AvailableVCores` | Current available CPU in virtual cores |
| `PendingMB` | Current pending memory resource requests in MB that are not yet fulfilled by the scheduler |
| `PendingVCores` | Current pending CPU allocation requests in virtual cores that are not yet fulfilled by the scheduler |
| `PendingContainers` | Current pending resource requests that are not yet fulfilled by the scheduler |
| `ReservedMB` | Current reserved memory in MB |
| `ReservedVCores` | Current reserved CPU in virtual cores |
| `ReservedContainers` | Current number of reserved containers |
| `ActiveUsers` | Current number of active users |
| `ActiveApplications` | Current number of active applications |
| `FairShareMB` | (FairScheduler only) Current fair share of memory in MB |
| `FairShareVCores` | (FairScheduler only) Current fair share of CPU in virtual cores |
| `MinShareMB` | (FairScheduler only) Minimum share of memory in MB |
| `MinShareVCores` | (FairScheduler only) Minimum share of CPU in virtual cores |
| `MaxShareMB` | (FairScheduler only) Maximum share of memory in MB |
| `MaxShareVCores` | (FairScheduler only) Maximum share of CPU in virtual cores |
NodeManagerMetrics
------------------
NodeManagerMetrics shows the statistics of the containers in the node. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `containersLaunched` | Total number of launched containers |
| `containersCompleted` | Total number of successfully completed containers |
| `containersFailed` | Total number of failed containers |
| `containersKilled` | Total number of killed containers |
| `containersIniting` | Current number of initializing containers |
| `containersRunning` | Current number of running containers |
| `allocatedContainers` | Current number of allocated containers |
| `allocatedGB` | Current allocated memory in GB |
| `availableGB` | Current available memory in GB |
ugi context
===========
UgiMetrics
----------
UgiMetrics is related to user and group information. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `LoginSuccessNumOps` | Total number of successful kerberos logins |
| `LoginSuccessAvgTime` | Average time for successful kerberos logins in milliseconds |
| `LoginFailureNumOps` | Total number of failed kerberos logins |
| `LoginFailureAvgTime` | Average time for failed kerberos logins in milliseconds |
| `getGroupsNumOps` | Total number of group resolutions |
| `getGroupsAvgTime` | Average time for group resolution in milliseconds |
| `getGroups`*num*`sNumOps` | Total number of group resolutions (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
| `getGroups`*num*`s50thPercentileLatency` | Shows the 50th percentile of group resolution time in milliseconds (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
| `getGroups`*num*`s75thPercentileLatency` | Shows the 75th percentile of group resolution time in milliseconds (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
| `getGroups`*num*`s90thPercentileLatency` | Shows the 90th percentile of group resolution time in milliseconds (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
| `getGroups`*num*`s95thPercentileLatency` | Shows the 95th percentile of group resolution time in milliseconds (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
| `getGroups`*num*`s99thPercentileLatency` | Shows the 99th percentile of group resolution time in milliseconds (*num* seconds granularity). *num* is specified by `hadoop.user.group.metrics.percentiles.intervals`. |
metricssystem context
=====================
MetricsSystem
-------------
MetricsSystem shows the statistics for metrics snapshots and publishes. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `NumActiveSources` | Current number of active metrics sources |
| `NumAllSources` | Total number of metrics sources |
| `NumActiveSinks` | Current number of active sinks |
| `NumAllSinks` | Total number of sinks  (BUT usually less than `NumActiveSinks`, see [HADOOP-9946](https://issues.apache.org/jira/browse/HADOOP-9946)) |
| `SnapshotNumOps` | Total number of operations to snapshot statistics from a metrics source |
| `SnapshotAvgTime` | Average time in milliseconds to snapshot statistics from a metrics source |
| `PublishNumOps` | Total number of operations to publish statistics to a sink |
| `PublishAvgTime` | Average time in milliseconds to publish statistics to a sink |
| `DroppedPubAll` | Total number of dropped publishes |
| `Sink_`*instance*`NumOps` | Total number of sink operations for the *instance* |
| `Sink_`*instance*`AvgTime` | Average time in milliseconds of sink operations for the *instance* |
| `Sink_`*instance*`Dropped` | Total number of dropped sink operations for the *instance* |
| `Sink_`*instance*`Qsize` | Current queue length of sink operations  (BUT always set to 0 because nothing to increment this metrics, see [HADOOP-9941](https://issues.apache.org/jira/browse/HADOOP-9941)) |
default context
===============
StartupProgress
---------------
StartupProgress metrics shows the statistics of NameNode startup. Four metrics are exposed for each startup phase based on its name. The startup *phase*s are `LoadingFsImage`, `LoadingEdits`, `SavingCheckpoint`, and `SafeMode`. Each metrics record contains Hostname tag as additional information along with metrics.
| Name | Description |
|:---- |:---- |
| `ElapsedTime` | Total elapsed time in milliseconds |
| `PercentComplete` | Current rate completed in NameNode startup progress  (The max value is not 100 but 1.0) |
| *phase*`Count` | Total number of steps completed in the phase |
| *phase*`ElapsedTime` | Total elapsed time in the phase in milliseconds |
| *phase*`Total` | Total number of steps in the phase |
| *phase*`PercentComplete` | Current rate completed in the phase  (The max value is not 100 but 1.0) |

View File

@ -0,0 +1,145 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Native Libraries Guide
======================
* [Native Libraries Guide](#Native_Libraries_Guide)
* [Overview](#Overview)
* [Native Hadoop Library](#Native_Hadoop_Library)
* [Usage](#Usage)
* [Components](#Components)
* [Supported Platforms](#Supported_Platforms)
* [Download](#Download)
* [Build](#Build)
* [Runtime](#Runtime)
* [Check](#Check)
* [Native Shared Libraries](#Native_Shared_Libraries)
Overview
--------
This guide describes the native hadoop library and includes a small discussion about native shared libraries.
Note: Depending on your environment, the term "native libraries" could refer to all \*.so's you need to compile; and, the term "native compression" could refer to all \*.so's you need to compile that are specifically related to compression. Currently, however, this document only addresses the native hadoop library (`libhadoop.so`). The document for libhdfs library (`libhdfs.so`) is [here](../hadoop-hdfs/LibHdfs.html).
Native Hadoop Library
---------------------
Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native hadoop library. On the \*nix platforms the library is named `libhadoop.so`.
Usage
-----
It is fairly easy to use the native hadoop library:
1. Review the components.
2. Review the supported platforms.
3. Either download a hadoop release, which will include a pre-built version of the native hadoop library, or build your own version of the native hadoop library. Whether you download or build, the name for the library is the same: libhadoop.so
4. Install the compression codec development packages (\>zlib-1.2, \>gzip-1.2):
* If you download the library, install one or more development packages - whichever compression codecs you want to use with your deployment.
* If you build the library, it is mandatory to install both development packages.
5. Check the runtime log files.
Components
----------
The native hadoop library includes various components:
* Compression Codecs (bzip2, lz4, snappy, zlib)
* Native IO utilities for [HDFS Short-Circuit Local Reads](../hadoop-hdfs/ShortCircuitLocalReads.html) and [Centralized Cache Management in HDFS](../hadoop-hdfs/CentralizedCacheManagement.html)
* CRC32 checksum implementation
Supported Platforms
-------------------
The native hadoop library is supported on \*nix platforms only. The library does not to work with Cygwin or the Mac OS X platform.
The native hadoop library is mainly used on the GNU/Linus platform and has been tested on these distributions:
* RHEL4/Fedora
* Ubuntu
* Gentoo
On all the above distributions a 32/64 bit native hadoop library will work with a respective 32/64 bit jvm.
Download
--------
The pre-built 32-bit i386-Linux native hadoop library is available as part of the hadoop distribution and is located in the `lib/native` directory. You can download the hadoop distribution from Hadoop Common Releases.
Be sure to install the zlib and/or gzip development packages - whichever compression codecs you want to use with your deployment.
Build
-----
The native hadoop library is written in ANSI C and is built using the GNU autotools-chain (autoconf, autoheader, automake, autoscan, libtool). This means it should be straight-forward to build the library on any platform with a standards-compliant C compiler and the GNU autotools-chain (see the supported platforms).
The packages you need to install on the target platform are:
* C compiler (e.g. GNU C Compiler)
* GNU Autools Chain: autoconf, automake, libtool
* zlib-development package (stable version \>= 1.2.0)
* openssl-development package(e.g. libssl-dev)
Once you installed the prerequisite packages use the standard hadoop pom.xml file and pass along the native flag to build the native hadoop library:
$ mvn package -Pdist,native -DskipTests -Dtar
You should see the newly-built library in:
$ hadoop-dist/target/hadoop-${project.version}/lib/native
Please note the following:
* It is mandatory to install both the zlib and gzip development packages on the target platform in order to build the native hadoop library; however, for deployment it is sufficient to install just one package if you wish to use only one codec.
* It is necessary to have the correct 32/64 libraries for zlib, depending on the 32/64 bit jvm for the target platform, in order to build and deploy the native hadoop library.
Runtime
-------
The bin/hadoop script ensures that the native hadoop library is on the library path via the system property: `-Djava.library.path=<path> `
During runtime, check the hadoop log files for your MapReduce tasks.
* If everything is all right, then: `DEBUG util.NativeCodeLoader - Trying to load the custom-built native-hadoop library...` `INFO util.NativeCodeLoader - Loaded the native-hadoop library`
* If something goes wrong, then: `INFO util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable`
Check
-----
NativeLibraryChecker is a tool to check whether native libraries are loaded correctly. You can launch NativeLibraryChecker as follows:
$ hadoop checknative -a
14/12/06 01:30:45 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version
14/12/06 01:30:45 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /home/ozawa/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib/x86_64-linux-gnu/libz.so.1
snappy: true /usr/lib/libsnappy.so.1
lz4: true revision:99
bzip2: false
Native Shared Libraries
-----------------------
You can load any native shared library using DistributedCache for distributing and symlinking the library files.
This example shows you how to distribute a shared library, mylib.so, and load it from a MapReduce task.
1. First copy the library to the HDFS: `bin/hadoop fs -copyFromLocal mylib.so.1 /libraries/mylib.so.1`
2. The job launching program should contain the following: `DistributedCache.createSymlink(conf);` `DistributedCache.addCacheFile("hdfs://host:port/libraries/mylib.so. 1#mylib.so", conf);`
3. The MapReduce task can contain: `System.loadLibrary("mylib.so");`
Note: If you downloaded or built the native hadoop library, you dont need to use DistibutedCache to make the library available to your MapReduce tasks.

View File

@ -0,0 +1,104 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
* [Rack Awareness](#Rack_Awareness)
* [python Example](#python_Example)
* [bash Example](#bash_Example)
Rack Awareness
==============
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Hadoop master daemons obtain the rack id of the cluster slaves by invoking either an external script or java class as specified by configuration files. Using either the java class or external script for topology, output must adhere to the java **org.apache.hadoop.net.DNSToSwitchMapping** interface. The interface expects a one-to-one correspondence to be maintained and the topology information in the format of '/myrack/myhost', where '/' is the topology delimiter, 'myrack' is the rack identifier, and 'myhost' is the individual host. Assuming a single /24 subnet per rack, one could use the format of '/192.168.100.0/192.168.100.5' as a unique rack-host topology mapping.
To use the java class for topology mapping, the class name is specified by the **topology.node.switch.mapping.impl** parameter in the configuration file. An example, NetworkTopology.java, is included with the hadoop distribution and can be customized by the Hadoop administrator. Using a Java class instead of an external script has a performance benefit in that Hadoop doesn't need to fork an external process when a new slave node registers itself.
If implementing an external script, it will be specified with the **topology.script.file.name** parameter in the configuration files. Unlike the java class, the external topology script is not included with the Hadoop distribution and is provided by the administrator. Hadoop will send multiple IP addresses to ARGV when forking the topology script. The number of IP addresses sent to the topology script is controlled with **net.topology.script.number.args** and defaults to 100. If **net.topology.script.number.args** was changed to 1, a topology script would get forked for each IP submitted by DataNodes and/or NodeManagers.
If **topology.script.file.name** or **topology.node.switch.mapping.impl** is not set, the rack id '/default-rack' is returned for any passed IP address. While this behavior appears desirable, it can cause issues with HDFS block replication as default behavior is to write one replicated block off rack and is unable to do so as there is only a single rack named '/default-rack'.
An additional configuration setting is **mapreduce.jobtracker.taskcache.levels** which determines the number of levels (in the network topology) of caches MapReduce will use. So, for example, if it is the default value of 2, two levels of caches will be constructed - one for hosts (host -\> task mapping) and another for racks (rack -\> task mapping). Giving us our one-to-one mapping of '/myrack/myhost'.
python Example
--------------
```python
#!/usr/bin/python
# this script makes assumptions about the physical environment.
# 1) each rack is its own layer 3 network with a /24 subnet, which
# could be typical where each rack has its own
# switch with uplinks to a central core router.
#
# +-----------+
# |core router|
# +-----------+
# / \
# +-----------+ +-----------+
# |rack switch| |rack switch|
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
# | data node | | data node |
# +-----------+ +-----------+
#
# 2) topology script gets list of IP's as input, calculates network address, and prints '/network_address/ip'.
import netaddr
import sys
sys.argv.pop(0) # discard name of topology script from argv list as we just want IP addresses
netmask = '255.255.255.0' # set netmask to what's being used in your environment. The example uses a /24
for ip in sys.argv: # loop over list of datanode IP's
address = '{0}/{1}'.format(ip, netmask) # format address string so it looks like 'ip/netmask' to make netaddr work
try:
network_address = netaddr.IPNetwork(address).network # calculate and print network address
print "/{0}".format(network_address)
except:
print "/rack-unknown" # print catch-all value if unable to calculate network address
```
bash Example
------------
```bash
#!/bin/bash
# Here's a bash example to show just how simple these scripts can be
# Assuming we have flat network with everything on a single switch, we can fake a rack topology.
# This could occur in a lab environment where we have limited nodes,like 2-8 physical machines on a unmanaged switch.
# This may also apply to multiple virtual machines running on the same physical hardware.
# The number of machines isn't important, but that we are trying to fake a network topology when there isn't one.
#
# +----------+ +--------+
# |jobtracker| |datanode|
# +----------+ +--------+
# \ /
# +--------+ +--------+ +--------+
# |datanode|--| switch |--|datanode|
# +--------+ +--------+ +--------+
# / \
# +--------+ +--------+
# |datanode| |namenode|
# +--------+ +--------+
#
# With this network topology, we are treating each host as a rack. This is being done by taking the last octet
# in the datanode's IP and prepending it with the word '/rack-'. The advantage for doing this is so HDFS
# can create its 'off-rack' block copy.
# 1) 'echo $@' will echo all ARGV values to xargs.
# 2) 'xargs' will enforce that we print a single argv value per line
# 3) 'awk' will split fields on dots and append the last field to the string '/rack-'. If awk
# fails to split on four dots, it will still print '/rack-' last field value
echo $@ | xargs -n 1 | awk -F '.' '{print "/rack-"$NF}'
```

View File

@ -0,0 +1,377 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
* [Hadoop in Secure Mode](#Hadoop_in_Secure_Mode)
* [Introduction](#Introduction)
* [Authentication](#Authentication)
* [End User Accounts](#End_User_Accounts)
* [User Accounts for Hadoop Daemons](#User_Accounts_for_Hadoop_Daemons)
* [Kerberos principals for Hadoop Daemons and Users](#Kerberos_principals_for_Hadoop_Daemons_and_Users)
* [Mapping from Kerberos principal to OS user account](#Mapping_from_Kerberos_principal_to_OS_user_account)
* [Mapping from user to group](#Mapping_from_user_to_group)
* [Proxy user](#Proxy_user)
* [Secure DataNode](#Secure_DataNode)
* [Data confidentiality](#Data_confidentiality)
* [Data Encryption on RPC](#Data_Encryption_on_RPC)
* [Data Encryption on Block data transfer.](#Data_Encryption_on_Block_data_transfer.)
* [Data Encryption on HTTP](#Data_Encryption_on_HTTP)
* [Configuration](#Configuration)
* [Permissions for both HDFS and local fileSystem paths](#Permissions_for_both_HDFS_and_local_fileSystem_paths)
* [Common Configurations](#Common_Configurations)
* [NameNode](#NameNode)
* [Secondary NameNode](#Secondary_NameNode)
* [DataNode](#DataNode)
* [WebHDFS](#WebHDFS)
* [ResourceManager](#ResourceManager)
* [NodeManager](#NodeManager)
* [Configuration for WebAppProxy](#Configuration_for_WebAppProxy)
* [LinuxContainerExecutor](#LinuxContainerExecutor)
* [MapReduce JobHistory Server](#MapReduce_JobHistory_Server)
Hadoop in Secure Mode
=====================
Introduction
------------
This document describes how to configure authentication for Hadoop in secure mode.
By default Hadoop runs in non-secure mode in which no actual authentication is required. By configuring Hadoop runs in secure mode, each user and service needs to be authenticated by Kerberos in order to use Hadoop services.
Security features of Hadoop consist of [authentication](#Authentication), [service level authorization](./ServiceLevelAuth.html), [authentication for Web consoles](./HttpAuthentication.html) and [data confidenciality](#Data_confidentiality).
Authentication
--------------
### End User Accounts
When service level authentication is turned on, end users using Hadoop in secure mode needs to be authenticated by Kerberos. The simplest way to do authentication is using `kinit` command of Kerberos.
### User Accounts for Hadoop Daemons
Ensure that HDFS and YARN daemons run as different Unix users, e.g. `hdfs` and `yarn`. Also, ensure that the MapReduce JobHistory server runs as different user such as `mapred`.
It's recommended to have them share a Unix group, for e.g. `hadoop`. See also "[Mapping from user to group](#Mapping_from_user_to_group)" for group management.
| User:Group | Daemons |
|:---- |:---- |
| hdfs:hadoop | NameNode, Secondary NameNode, JournalNode, DataNode |
| yarn:hadoop | ResourceManager, NodeManager |
| mapred:hadoop | MapReduce JobHistory Server |
### Kerberos principals for Hadoop Daemons and Users
For running hadoop service daemons in Hadoop in secure mode, Kerberos principals are required. Each service reads auhenticate information saved in keytab file with appropriate permission.
HTTP web-consoles should be served by principal different from RPC's one.
Subsections below shows the examples of credentials for Hadoop services.
#### HDFS
The NameNode keytab file, on the NameNode host, should look like the following:
$ klist -e -k -t /etc/security/keytab/nn.service.keytab
Keytab name: FILE:/etc/security/keytab/nn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
The Secondary NameNode keytab file, on that host, should look like the following:
$ klist -e -k -t /etc/security/keytab/sn.service.keytab
Keytab name: FILE:/etc/security/keytab/sn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 sn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
The DataNode keytab file, on each host, should look like the following:
$ klist -e -k -t /etc/security/keytab/dn.service.keytab
Keytab name: FILE:/etc/security/keytab/dn.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 dn/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
#### YARN
The ResourceManager keytab file, on the ResourceManager host, should look like the following:
$ klist -e -k -t /etc/security/keytab/rm.service.keytab
Keytab name: FILE:/etc/security/keytab/rm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 rm/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
The NodeManager keytab file, on each host, should look like the following:
$ klist -e -k -t /etc/security/keytab/nm.service.keytab
Keytab name: FILE:/etc/security/keytab/nm.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 nm/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
#### MapReduce JobHistory Server
The MapReduce JobHistory Server keytab file, on that host, should look like the following:
$ klist -e -k -t /etc/security/keytab/jhs.service.keytab
Keytab name: FILE:/etc/security/keytab/jhs.service.keytab
KVNO Timestamp Principal
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 jhs/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-256 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (AES-128 CTS mode with 96-bit SHA-1 HMAC)
4 07/18/11 21:08:09 host/full.qualified.domain.name@REALM.TLD (ArcFour with HMAC/md5)
### Mapping from Kerberos principal to OS user account
Hadoop maps Kerberos principal to OS user account using the rule specified by `hadoop.security.auth_to_local` which works in the same way as the `auth_to_local` in [Kerberos configuration file (krb5.conf)](http://web.mit.edu/Kerberos/krb5-latest/doc/admin/conf_files/krb5_conf.html). In addition, Hadoop `auth_to_local` mapping supports the **/L** flag that lowercases the returned name.
By default, it picks the first component of principal name as a user name if the realms matches to the `default_realm` (usually defined in /etc/krb5.conf). For example, `host/full.qualified.domain.name@REALM.TLD` is mapped to `host` by default rule.
Custom rules can be tested using the `hadoop kerbname` command. This command allows one to specify a principal and apply Hadoop's current auth_to_local ruleset. The output will be what identity Hadoop will use for its usage.
### Mapping from user to group
Though files on HDFS are associated to owner and group, Hadoop does not have the definition of group by itself. Mapping from user to group is done by OS or LDAP.
You can change a way of mapping by specifying the name of mapping provider as a value of `hadoop.security.group.mapping` See [HDFS Permissions Guide](../hadoop-hdfs/HdfsPermissionsGuide.html) for details.
Practically you need to manage SSO environment using Kerberos with LDAP for Hadoop in secure mode.
### Proxy user
Some products such as Apache Oozie which access the services of Hadoop on behalf of end users need to be able to impersonate end users. See [the doc of proxy user](./Superusers.html) for details.
### Secure DataNode
Because the data transfer protocol of DataNode does not use the RPC framework of Hadoop, DataNode must authenticate itself by using privileged ports which are specified by `dfs.datanode.address` and `dfs.datanode.http.address`. This authentication is based on the assumption that the attacker won't be able to get root privileges.
When you execute `hdfs datanode` command as root, server process binds privileged port at first, then drops privilege and runs as the user account specified by `HADOOP_SECURE_DN_USER`. This startup process uses jsvc installed to `JSVC_HOME`. You must specify `HADOOP_SECURE_DN_USER` and `JSVC_HOME` as environment variables on start up (in hadoop-env.sh).
As of version 2.6.0, SASL can be used to authenticate the data transfer protocol. In this configuration, it is no longer required for secured clusters to start the DataNode as root using jsvc and bind to privileged ports. To enable SASL on data transfer protocol, set `dfs.data.transfer.protection` in hdfs-site.xml, set a non-privileged port for `dfs.datanode.address`, set `dfs.http.policy` to *HTTPS\_ONLY* and make sure the `HADOOP_SECURE_DN_USER` environment variable is not defined. Note that it is not possible to use SASL on data transfer protocol if `dfs.datanode.address` is set to a privileged port. This is required for backwards-compatibility reasons.
In order to migrate an existing cluster that used root authentication to start using SASL instead, first ensure that version 2.6.0 or later has been deployed to all cluster nodes as well as any external applications that need to connect to the cluster. Only versions 2.6.0 and later of the HDFS client can connect to a DataNode that uses SASL for authentication of data transfer protocol, so it is vital that all callers have the correct version before migrating. After version 2.6.0 or later has been deployed everywhere, update configuration of any external applications to enable SASL. If an HDFS client is enabled for SASL, then it can connect successfully to a DataNode running with either root authentication or SASL authentication. Changing configuration for all clients guarantees that subsequent configuration changes on DataNodes will not disrupt the applications. Finally, each individual DataNode can be migrated by changing its configuration and restarting. It is acceptable to have a mix of some DataNodes running with root authentication and some DataNodes running with SASL authentication temporarily during this migration period, because an HDFS client enabled for SASL can connect to both.
Data confidentiality
--------------------
### Data Encryption on RPC
The data transfered between hadoop services and clients. Setting `hadoop.rpc.protection` to `"privacy"` in the core-site.xml activate data encryption.
### Data Encryption on Block data transfer.
You need to set `dfs.encrypt.data.transfer` to `"true"` in the hdfs-site.xml in order to activate data encryption for data transfer protocol of DataNode.
Optionally, you may set `dfs.encrypt.data.transfer.algorithm` to either "3des" or "rc4" to choose the specific encryption algorithm. If unspecified, then the configured JCE default on the system is used, which is usually 3DES.
Setting `dfs.encrypt.data.transfer.cipher.suites` to `AES/CTR/NoPadding` activates AES encryption. By default, this is unspecified, so AES is not used. When AES is used, the algorithm specified in `dfs.encrypt.data.transfer.algorithm` is still used during an initial key exchange. The AES key bit length can be configured by setting `dfs.encrypt.data.transfer.cipher.key.bitlength` to 128, 192 or 256. The default is 128.
AES offers the greatest cryptographic strength and the best performance. At this time, 3DES and RC4 have been used more often in Hadoop clusters.
### Data Encryption on HTTP
Data transfer between Web-console and clients are protected by using SSL(HTTPS).
Configuration
-------------
### Permissions for both HDFS and local fileSystem paths
The following table lists various paths on HDFS and local filesystems (on all nodes) and recommended permissions:
| Filesystem | Path | User:Group | Permissions |
|:---- |:---- |:---- |:---- |
| local | `dfs.namenode.name.dir` | hdfs:hadoop | `drwx------` |
| local | `dfs.datanode.data.dir` | hdfs:hadoop | `drwx------` |
| local | $HADOOP\_LOG\_DIR | hdfs:hadoop | `drwxrwxr-x` |
| local | $YARN\_LOG\_DIR | yarn:hadoop | `drwxrwxr-x` |
| local | `yarn.nodemanager.local-dirs` | yarn:hadoop | `drwxr-xr-x` |
| local | `yarn.nodemanager.log-dirs` | yarn:hadoop | `drwxr-xr-x` |
| local | container-executor | root:hadoop | `--Sr-s--*` |
| local | `conf/container-executor.cfg` | root:hadoop | `r-------*` |
| hdfs | / | hdfs:hadoop | `drwxr-xr-x` |
| hdfs | /tmp | hdfs:hadoop | `drwxrwxrwxt` |
| hdfs | /user | hdfs:hadoop | `drwxr-xr-x` |
| hdfs | `yarn.nodemanager.remote-app-log-dir` | yarn:hadoop | `drwxrwxrwxt` |
| hdfs | `mapreduce.jobhistory.intermediate-done-dir` | mapred:hadoop | `drwxrwxrwxt` |
| hdfs | `mapreduce.jobhistory.done-dir` | mapred:hadoop | `drwxr-x---` |
### Common Configurations
In order to turn on RPC authentication in hadoop, set the value of `hadoop.security.authentication` property to `"kerberos"`, and set security related settings listed below appropriately.
The following properties should be in the `core-site.xml` of all the nodes in the cluster.
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `hadoop.security.authentication` | *kerberos* | `simple` : No authentication. (default)  `kerberos` : Enable authentication by Kerberos. |
| `hadoop.security.authorization` | *true* | Enable [RPC service-level authorization](./ServiceLevelAuth.html). |
| `hadoop.rpc.protection` | *authentication* | *authentication* : authentication only (default)  *integrity* : integrity check in addition to authentication  *privacy* : data encryption in addition to integrity |
| `hadoop.security.auth_to_local` | `RULE:`*exp1* `RULE:`*exp2* *...* DEFAULT | The value is string containing new line characters. See [Kerberos documentation](http://web.mit.edu/Kerberos/krb5-latest/doc/admin/conf_files/krb5_conf.html) for format for *exp*. |
| `hadoop.proxyuser.`*superuser*`.hosts` | | comma separated hosts from which *superuser* access are allowd to impersonation. `*` means wildcard. |
| `hadoop.proxyuser.`*superuser*`.groups` | | comma separated groups to which users impersonated by *superuser* belongs. `*` means wildcard. |
### NameNode
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `dfs.block.access.token.enable` | *true* | Enable HDFS block access tokens for secure operations. |
| `dfs.https.enable` | *true* | This value is deprecated. Use dfs.http.policy |
| `dfs.http.policy` | *HTTP\_ONLY* or *HTTPS\_ONLY* or *HTTP\_AND\_HTTPS* | HTTPS\_ONLY turns off http access. This option takes precedence over the deprecated configuration dfs.https.enable and hadoop.ssl.enabled. If using SASL to authenticate data transfer protocol instead of running DataNode as root and using privileged ports, then this property must be set to *HTTPS\_ONLY* to guarantee authentication of HTTP servers. (See `dfs.data.transfer.protection`.) |
| `dfs.namenode.https-address` | *nn\_host\_fqdn:50470* | |
| `dfs.https.port` | *50470* | |
| `dfs.namenode.keytab.file` | */etc/security/keytab/nn.service.keytab* | Kerberos keytab file for the NameNode. |
| `dfs.namenode.kerberos.principal` | nn/\_HOST@REALM.TLD | Kerberos principal name for the NameNode. |
| `dfs.namenode.kerberos.internal.spnego.principal` | HTTP/\_HOST@REALM.TLD | HTTP Kerberos principal name for the NameNode. |
### Secondary NameNode
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `dfs.namenode.secondary.http-address` | *c\_nn\_host\_fqdn:50090* | |
| `dfs.namenode.secondary.https-port` | *50470* | |
| `dfs.secondary.namenode.keytab.file` | */etc/security/keytab/sn.service.keytab* | Kerberos keytab file for the Secondary NameNode. |
| `dfs.secondary.namenode.kerberos.principal` | sn/\_HOST@REALM.TLD | Kerberos principal name for the Secondary NameNode. |
| `dfs.secondary.namenode.kerberos.internal.spnego.principal` | HTTP/\_HOST@REALM.TLD | HTTP Kerberos principal name for the Secondary NameNode. |
### DataNode
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `dfs.datanode.data.dir.perm` | 700 | |
| `dfs.datanode.address` | *0.0.0.0:1004* | Secure DataNode must use privileged port in order to assure that the server was started securely. This means that the server must be started via jsvc. Alternatively, this must be set to a non-privileged port if using SASL to authenticate data transfer protocol. (See `dfs.data.transfer.protection`.) |
| `dfs.datanode.http.address` | *0.0.0.0:1006* | Secure DataNode must use privileged port in order to assure that the server was started securely. This means that the server must be started via jsvc. |
| `dfs.datanode.https.address` | *0.0.0.0:50470* | |
| `dfs.datanode.keytab.file` | */etc/security/keytab/dn.service.keytab* | Kerberos keytab file for the DataNode. |
| `dfs.datanode.kerberos.principal` | dn/\_HOST@REALM.TLD | Kerberos principal name for the DataNode. |
| `dfs.encrypt.data.transfer` | *false* | set to `true` when using data encryption |
| `dfs.encrypt.data.transfer.algorithm` | | optionally set to `3des` or `rc4` when using data encryption to control encryption algorithm |
| `dfs.encrypt.data.transfer.cipher.suites` | | optionally set to `AES/CTR/NoPadding` to activate AES encryption when using data encryption |
| `dfs.encrypt.data.transfer.cipher.key.bitlength` | | optionally set to `128`, `192` or `256` to control key bit length when using AES with data encryption |
| `dfs.data.transfer.protection` | | *authentication* : authentication only  *integrity* : integrity check in addition to authentication  *privacy* : data encryption in addition to integrity This property is unspecified by default. Setting this property enables SASL for authentication of data transfer protocol. If this is enabled, then `dfs.datanode.address` must use a non-privileged port, `dfs.http.policy` must be set to *HTTPS\_ONLY* and the `HADOOP_SECURE_DN_USER` environment variable must be undefined when starting the DataNode process. |
### WebHDFS
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `dfs.web.authentication.kerberos.principal` | http/\_HOST@REALM.TLD | Kerberos keytab file for the WebHDFS. |
| `dfs.web.authentication.kerberos.keytab` | */etc/security/keytab/http.service.keytab* | Kerberos principal name for WebHDFS. |
### ResourceManager
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `yarn.resourcemanager.keytab` | */etc/security/keytab/rm.service.keytab* | Kerberos keytab file for the ResourceManager. |
| `yarn.resourcemanager.principal` | rm/\_HOST@REALM.TLD | Kerberos principal name for the ResourceManager. |
### NodeManager
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `yarn.nodemanager.keytab` | */etc/security/keytab/nm.service.keytab* | Kerberos keytab file for the NodeManager. |
| `yarn.nodemanager.principal` | nm/\_HOST@REALM.TLD | Kerberos principal name for the NodeManager. |
| `yarn.nodemanager.container-executor.class` | `org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor` | Use LinuxContainerExecutor. |
| `yarn.nodemanager.linux-container-executor.group` | *hadoop* | Unix group of the NodeManager. |
| `yarn.nodemanager.linux-container-executor.path` | */path/to/bin/container-executor* | The path to the executable of Linux container executor. |
### Configuration for WebAppProxy
The `WebAppProxy` provides a proxy between the web applications exported by an application and an end user. If security is enabled it will warn users before accessing a potentially unsafe web application. Authentication and authorization using the proxy is handled just like any other privileged web application.
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `yarn.web-proxy.address` | `WebAppProxy` host:port for proxy to AM web apps. | *host:port* if this is the same as `yarn.resourcemanager.webapp.address` or it is not defined then the `ResourceManager` will run the proxy otherwise a standalone proxy server will need to be launched. |
| `yarn.web-proxy.keytab` | */etc/security/keytab/web-app.service.keytab* | Kerberos keytab file for the WebAppProxy. |
| `yarn.web-proxy.principal` | wap/\_HOST@REALM.TLD | Kerberos principal name for the WebAppProxy. |
### LinuxContainerExecutor
A `ContainerExecutor` used by YARN framework which define how any *container* launched and controlled.
The following are the available in Hadoop YARN:
| ContainerExecutor | Description |
|:---- |:---- |
| `DefaultContainerExecutor` | The default executor which YARN uses to manage container execution. The container process has the same Unix user as the NodeManager. |
| `LinuxContainerExecutor` | Supported only on GNU/Linux, this executor runs the containers as either the YARN user who submitted the application (when full security is enabled) or as a dedicated user (defaults to nobody) when full security is not enabled. When full security is enabled, this executor requires all user accounts to be created on the cluster nodes where the containers are launched. It uses a *setuid* executable that is included in the Hadoop distribution. The NodeManager uses this executable to launch and kill containers. The setuid executable switches to the user who has submitted the application and launches or kills the containers. For maximum security, this executor sets up restricted permissions and user/group ownership of local files and directories used by the containers such as the shared objects, jars, intermediate files, log files etc. Particularly note that, because of this, except the application owner and NodeManager, no other user can access any of the local files/directories including those localized as part of the distributed cache. |
To build the LinuxContainerExecutor executable run:
$ mvn package -Dcontainer-executor.conf.dir=/etc/hadoop/
The path passed in `-Dcontainer-executor.conf.dir` should be the path on the cluster nodes where a configuration file for the setuid executable should be located. The executable should be installed in $HADOOP\_YARN\_HOME/bin.
The executable must have specific permissions: 6050 or `--Sr-s---` permissions user-owned by *root* (super-user) and group-owned by a special group (e.g. `hadoop`) of which the NodeManager Unix user is the group member and no ordinary application user is. If any application user belongs to this special group, security will be compromised. This special group name should be specified for the configuration property `yarn.nodemanager.linux-container-executor.group` in both `conf/yarn-site.xml` and `conf/container-executor.cfg`.
For example, let's say that the NodeManager is run as user *yarn* who is part of the groups users and *hadoop*, any of them being the primary group. Let also be that *users* has both *yarn* and another user (application submitter) *alice* as its members, and *alice* does not belong to *hadoop*. Going by the above description, the setuid/setgid executable should be set 6050 or `--Sr-s---` with user-owner as *yarn* and group-owner as *hadoop* which has *yarn* as its member (and not *users* which has *alice* also as its member besides *yarn*).
The LinuxTaskController requires that paths including and leading up to the directories specified in `yarn.nodemanager.local-dirs` and `yarn.nodemanager.log-dirs` to be set 755 permissions as described above in the table on permissions on directories.
* `conf/container-executor.cfg`
The executable requires a configuration file called `container-executor.cfg` to be present in the configuration directory passed to the mvn target mentioned above.
The configuration file must be owned by the user running NodeManager (user `yarn` in the above example), group-owned by anyone and should have the permissions 0400 or `r--------` .
The executable requires following configuration items to be present in the `conf/container-executor.cfg` file. The items should be mentioned as simple key=value pairs, one per-line:
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `yarn.nodemanager.linux-container-executor.group` | *hadoop* | Unix group of the NodeManager. The group owner of the *container-executor* binary should be this group. Should be same as the value with which the NodeManager is configured. This configuration is required for validating the secure access of the *container-executor* binary. |
| `banned.users` | hdfs,yarn,mapred,bin | Banned users. |
| `allowed.system.users` | foo,bar | Allowed system users. |
| `min.user.id` | 1000 | Prevent other super-users. |
To re-cap, here are the local file-sysytem permissions required for the various paths related to the `LinuxContainerExecutor`:
| Filesystem | Path | User:Group | Permissions |
|:---- |:---- |:---- |:---- |
| local | container-executor | root:hadoop | `--Sr-s--*` |
| local | `conf/container-executor.cfg` | root:hadoop | `r-------*` |
| local | `yarn.nodemanager.local-dirs` | yarn:hadoop | `drwxr-xr-x` |
| local | `yarn.nodemanager.log-dirs` | yarn:hadoop | `drwxr-xr-x` |
### MapReduce JobHistory Server
| Parameter | Value | Notes |
|:---- |:---- |:---- |
| `mapreduce.jobhistory.address` | MapReduce JobHistory Server *host:port* | Default port is 10020. |
| `mapreduce.jobhistory.keytab` | */etc/security/keytab/jhs.service.keytab* | Kerberos keytab file for the MapReduce JobHistory Server. |
| `mapreduce.jobhistory.principal` | jhs/\_HOST@REALM.TLD | Kerberos principal name for the MapReduce JobHistory Server. |

View File

@ -0,0 +1,144 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Service Level Authorization Guide
=================================
* [Service Level Authorization Guide](#Service_Level_Authorization_Guide)
* [Purpose](#Purpose)
* [Prerequisites](#Prerequisites)
* [Overview](#Overview)
* [Configuration](#Configuration)
* [Enable Service Level Authorization](#Enable_Service_Level_Authorization)
* [Hadoop Services and Configuration Properties](#Hadoop_Services_and_Configuration_Properties)
* [Access Control Lists](#Access_Control_Lists)
* [Refreshing Service Level Authorization Configuration](#Refreshing_Service_Level_Authorization_Configuration)
* [Examples](#Examples)
Purpose
-------
This document describes how to configure and manage Service Level Authorization for Hadoop.
Prerequisites
-------------
Make sure Hadoop is installed, configured and setup correctly. For more information see:
* [Single Node Setup](./SingleCluster.html) for first-time users.
* [Cluster Setup](./ClusterSetup.html) for large, distributed clusters.
Overview
--------
Service Level Authorization is the initial authorization mechanism to ensure clients connecting to a particular Hadoop service have the necessary, pre-configured, permissions and are authorized to access the given service. For example, a MapReduce cluster can use this mechanism to allow a configured list of users/groups to submit jobs.
The `$HADOOP_CONF_DIR/hadoop-policy.xml` configuration file is used to define the access control lists for various Hadoop services.
Service Level Authorization is performed much before to other access control checks such as file-permission checks, access control on job queues etc.
Configuration
-------------
This section describes how to configure service-level authorization via the configuration file `$HADOOP_CONF_DIR/hadoop-policy.xml`.
### Enable Service Level Authorization
By default, service-level authorization is disabled for Hadoop. To enable it set the configuration property hadoop.security.authorization to true in `$HADOOP_CONF_DIR/core-site.xml`.
### Hadoop Services and Configuration Properties
This section lists the various Hadoop services and their configuration knobs:
| Property | Service |
|:---- |:---- |
| security.client.protocol.acl | ACL for ClientProtocol, which is used by user code via the DistributedFileSystem. |
| security.client.datanode.protocol.acl | ACL for ClientDatanodeProtocol, the client-to-datanode protocol for block recovery. |
| security.datanode.protocol.acl | ACL for DatanodeProtocol, which is used by datanodes to communicate with the namenode. |
| security.inter.datanode.protocol.acl | ACL for InterDatanodeProtocol, the inter-datanode protocol for updating generation timestamp. |
| security.namenode.protocol.acl | ACL for NamenodeProtocol, the protocol used by the secondary namenode to communicate with the namenode. |
| security.inter.tracker.protocol.acl | ACL for InterTrackerProtocol, used by the tasktrackers to communicate with the jobtracker. |
| security.job.submission.protocol.acl | ACL for JobSubmissionProtocol, used by job clients to communciate with the jobtracker for job submission, querying job status etc. |
| security.task.umbilical.protocol.acl | ACL for TaskUmbilicalProtocol, used by the map and reduce tasks to communicate with the parent tasktracker. |
| security.refresh.policy.protocol.acl | ACL for RefreshAuthorizationPolicyProtocol, used by the dfsadmin and mradmin commands to refresh the security policy in-effect. |
| security.ha.service.protocol.acl | ACL for HAService protocol used by HAAdmin to manage the active and stand-by states of namenode. |
### Access Control Lists
`$HADOOP_CONF_DIR/hadoop-policy.xml` defines an access control list for each Hadoop service. Every access control list has a simple format:
The list of users and groups are both comma separated list of names. The two lists are separated by a space.
Example: `user1,user2 group1,group2`.
Add a blank at the beginning of the line if only a list of groups is to be provided, equivalently a comma-separated list of users followed by a space or nothing implies only a set of given users.
A special value of `*` implies that all users are allowed to access the service.
If access control list is not defined for a service, the value of `security.service.authorization.default.acl` is applied. If `security.service.authorization.default.acl` is not defined, `*` is applied.
* Blocked Access Control ListsIn some cases, it is required to specify blocked access control list for a service. This specifies the list of users and groups who are not authorized to access the service. The format of the blocked access control list is same as that of access control list. The blocked access control list can be specified via `$HADOOP_CONF_DIR/hadoop-policy.xml`. The property name is derived by suffixing with ".blocked".
Example: The property name of blocked access control list for `security.client.protocol.acl>> will be <<<security.client.protocol.acl.blocked`
For a service, it is possible to specify both an access control list and a blocked control list. A user is authorized to access the service if the user is in the access control and not in the blocked access control list.
If blocked access control list is not defined for a service, the value of `security.service.authorization.default.acl.blocked` is applied. If `security.service.authorization.default.acl.blocked` is not defined, empty blocked access control list is applied.
### Refreshing Service Level Authorization Configuration
The service-level authorization configuration for the NameNode and JobTracker can be changed without restarting either of the Hadoop master daemons. The cluster administrator can change `$HADOOP_CONF_DIR/hadoop-policy.xml` on the master nodes and instruct the NameNode and JobTracker to reload their respective configurations via the `-refreshServiceAcl` switch to `dfsadmin` and `mradmin` commands respectively.
Refresh the service-level authorization configuration for the NameNode:
$ bin/hadoop dfsadmin -refreshServiceAcl
Refresh the service-level authorization configuration for the JobTracker:
$ bin/hadoop mradmin -refreshServiceAcl
Of course, one can use the `security.refresh.policy.protocol.acl` property in `$HADOOP_CONF_DIR/hadoop-policy.xml` to restrict access to the ability to refresh the service-level authorization configuration to certain users/groups.
* Access Control using list of ip addresses, host names and ip rangesAccess to a service can be controlled based on the ip address of the client accessing the service. It is possible to restrict access to a service from a set of machines by specifying a list of ip addresses, host names and ip ranges. The property name for each service is derived from the corresponding acl's property name. If the property name of acl is security.client.protocol.acl, property name for the hosts list will be security.client.protocol.hosts.
If hosts list is not defined for a service, the value of `security.service.authorization.default.hosts` is applied. If `security.service.authorization.default.hosts` is not defined, `*` is applied.
It is possible to specify a blocked list of hosts. Only those machines which are in the hosts list, but not in the blocked hosts list will be granted access to the service. The property name is derived by suffixing with ".blocked".
Example: The property name of blocked hosts list for `security.client.protocol.hosts>> will be <<<security.client.protocol.hosts.blocked`
If blocked hosts list is not defined for a service, the value of `security.service.authorization.default.hosts.blocked` is applied. If `security.service.authorization.default.hosts.blocked` is not defined, empty blocked hosts list is applied.
### Examples
Allow only users `alice`, `bob` and users in the `mapreduce` group to submit jobs to the MapReduce cluster:
<property>
<name>security.job.submission.protocol.acl</name>
<value>alice,bob mapreduce</value>
</property>
Allow only DataNodes running as the users who belong to the group datanodes to communicate with the NameNode:
<property>
<name>security.datanode.protocol.acl</name>
<value>datanodes</value>
</property>
Allow any user to talk to the HDFS cluster as a DFSClient:
<property>
<name>security.client.protocol.acl</name>
<value>*</value>
</property>

View File

@ -0,0 +1,231 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
#set ( $H3 = '###' )
#set ( $H4 = '####' )
#set ( $H5 = '#####' )
Hadoop: Setting up a Single Node Cluster.
=========================================
* [Hadoop: Setting up a Single Node Cluster.](#Hadoop:_Setting_up_a_Single_Node_Cluster.)
* [Purpose](#Purpose)
* [Prerequisites](#Prerequisites)
* [Supported Platforms](#Supported_Platforms)
* [Required Software](#Required_Software)
* [Installing Software](#Installing_Software)
* [Download](#Download)
* [Prepare to Start the Hadoop Cluster](#Prepare_to_Start_the_Hadoop_Cluster)
* [Standalone Operation](#Standalone_Operation)
* [Pseudo-Distributed Operation](#Pseudo-Distributed_Operation)
* [Configuration](#Configuration)
* [Setup passphraseless ssh](#Setup_passphraseless_ssh)
* [Execution](#Execution)
* [YARN on a Single Node](#YARN_on_a_Single_Node)
* [Fully-Distributed Operation](#Fully-Distributed_Operation)
Purpose
-------
This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).
Prerequisites
-------------
$H3 Supported Platforms
* GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
* Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see [wiki page](http://wiki.apache.org/hadoop/Hadoop2OnWindows).
$H3 Required Software
Required software for Linux include:
1. Java™ must be installed. Recommended Java versions are described at [HadoopJavaVersions](http://wiki.apache.org/hadoop/HadoopJavaVersions).
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
$H3 Installing Software
If your cluster doesn't have the requisite software you will need to install it.
For example on Ubuntu Linux:
$ sudo apt-get install ssh
$ sudo apt-get install rsync
Download
--------
To get a Hadoop distribution, download a recent stable release from one of the [Apache Download Mirrors](http://www.apache.org/dyn/closer.cgi/hadoop/common/).
Prepare to Start the Hadoop Cluster
-----------------------------------
Unpack the downloaded Hadoop distribution. In the distribution, edit the file `etc/hadoop/hadoop-env.sh` to define some parameters as follows:
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
Try the following command:
$ bin/hadoop
This will display the usage documentation for the hadoop script.
Now you are ready to start your Hadoop cluster in one of the three supported modes:
* [Local (Standalone) Mode](#Standalone_Operation)
* [Pseudo-Distributed Mode](#Pseudo-Distributed_Operation)
* [Fully-Distributed Mode](#Fully-Distributed_Operation)
Standalone Operation
--------------------
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
$ mkdir input
$ cp etc/hadoop/*.xml input
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+'
$ cat output/*
Pseudo-Distributed Operation
----------------------------
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
$H3 Configuration
Use the following:
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
$H3 Setup passphraseless ssh
Now check that you can ssh to the localhost without a passphrase:
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
$ export HADOOP\_PREFIX=/usr/local/hadoop
$H3 Execution
The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see [YARN on Single Node](#YARN_on_Single_Node).
1. Format the filesystem:
$ bin/hdfs namenode -format
2. Start NameNode daemon and DataNode daemon:
$ sbin/start-dfs.sh
The hadoop daemon log output is written to the `$HADOOP_LOG_DIR` directory (defaults to `$HADOOP_HOME/logs`).
3. Browse the web interface for the NameNode; by default it is available at:
* NameNode - `http://localhost:50070/`
4. Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
5. Copy the input files into the distributed filesystem:
$ bin/hdfs dfs -put etc/hadoop input
6. Run some of the examples provided:
$ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-${project.version}.jar grep input output 'dfs[a-z.]+'
7. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:
$ bin/hdfs dfs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ bin/hdfs dfs -cat output/*
8. When you're done, stop the daemons with:
$ sbin/stop-dfs.sh
$H3 YARN on a Single Node
You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.
The following instructions assume that 1. ~ 4. steps of [the above instructions](#Execution) are already executed.
1. Configure parameters as follows:`etc/hadoop/mapred-site.xml`:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
`etc/hadoop/yarn-site.xml`:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2. Start ResourceManager daemon and NodeManager daemon:
$ sbin/start-yarn.sh
3. Browse the web interface for the ResourceManager; by default it is available at:
* ResourceManager - `http://localhost:8088/`
4. Run a MapReduce job.
5. When you're done, stop the daemons with:
$ sbin/stop-yarn.sh
Fully-Distributed Operation
---------------------------
For information on setting up fully-distributed, non-trivial clusters see [Cluster Setup](./ClusterSetup.html).

View File

@ -0,0 +1,20 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Single Node Setup
=================
This page will be removed in the next major release.
See [Single Cluster Setup](./SingleCluster.html) to set up and configure a single-node Hadoop installation.

View File

@ -0,0 +1,106 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Proxy user - Superusers Acting On Behalf Of Other Users
=======================================================
* [Proxy user - Superusers Acting On Behalf Of Other Users](#Proxy_user_-_Superusers_Acting_On_Behalf_Of_Other_Users)
* [Introduction](#Introduction)
* [Use Case](#Use_Case)
* [Code example](#Code_example)
* [Configurations](#Configurations)
* [Caveats](#Caveats)
Introduction
------------
This document describes how a superuser can submit jobs or access hdfs on behalf of another user.
Use Case
--------
The code example described in the next section is applicable for the following use case.
A superuser with username 'super' wants to submit job and access hdfs on behalf of a user joe. The superuser has kerberos credentials but user joe doesn't have any. The tasks are required to run as user joe and any file accesses on namenode are required to be done as user joe. It is required that user joe can connect to the namenode or job tracker on a connection authenticated with super's kerberos credentials. In other words super is impersonating the user joe.
Some products such as Apache Oozie need this.
Code example
------------
In this example super's credentials are used for login and a proxy user ugi object is created for joe. The operations are performed within the doAs method of this proxy user ugi object.
...
//Create ugi for joe. The login user is 'super'.
UserGroupInformation ugi =
UserGroupInformation.createProxyUser("joe", UserGroupInformation.getLoginUser());
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
//Submit a job
JobClient jc = new JobClient(conf);
jc.submitJob(conf);
//OR access hdfs
FileSystem fs = FileSystem.get(conf);
fs.mkdir(someFilePath);
}
}
Configurations
--------------
You can configure proxy user using properties `hadoop.proxyuser.$superuser.hosts` along with either or both of `hadoop.proxyuser.$superuser.groups` and `hadoop.proxyuser.$superuser.users`.
By specifying as below in core-site.xml, the superuser named `super` can connect only from `host1` and `host2` to impersonate a user belonging to `group1` and `group2`.
<property>
<name>hadoop.proxyuser.super.hosts</name>
<value>host1,host2</value>
</property>
<property>
<name>hadoop.proxyuser.super.groups</name>
<value>group1,group2</value>
</property>
If these configurations are not present, impersonation will not be allowed and connection will fail.
If more lax security is preferred, the wildcard value \* may be used to allow impersonation from any host or of any user. For example, by specifying as below in core-site.xml, user named `oozie` accessing from any host can impersonate any user belonging to any group.
<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>
The `hadoop.proxyuser.$superuser.hosts` accepts list of ip addresses, ip address ranges in CIDR format and/or host names. For example, by specifying as below, user named `super` accessing from hosts in the range `10.222.0.0-15` and `10.113.221.221` can impersonate `user1` and `user2`.
<property>
<name>hadoop.proxyuser.super.hosts</name>
<value>10.222.0.0/16,10.113.221.221</value>
</property>
<property>
<name>hadoop.proxyuser.super.users</name>
<value>user1,user2</value>
</property>
Caveats
-------
If the cluster is running in [Secure Mode](./SecureMode.html), the superuser must have kerberos credentials to be able to impersonate another user.
It cannot use delegation tokens for this feature. It would be wrong if superuser adds its own delegation token to the proxy user ugi, as it will allow the proxy user to connect to the service with the privileges of the superuser.
However, if the superuser does want to give a delegation token to joe, it must first impersonate joe and get a delegation token for joe, in the same way as the code example above, and add it to the ugi of joe. In this way the delegation token will have the owner as joe.

View File

@ -0,0 +1,209 @@
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
Enabling Dapper-like Tracing in Hadoop
======================================
* [Enabling Dapper-like Tracing in Hadoop](#Enabling_Dapper-like_Tracing_in_Hadoop)
* [Dapper-like Tracing in Hadoop](#Dapper-like_Tracing_in_Hadoop)
* [HTrace](#HTrace)
* [Samplers](#Samplers)
* [SpanReceivers](#SpanReceivers)
* [Setting up ZipkinSpanReceiver](#Setting_up_ZipkinSpanReceiver)
* [Dynamic update of tracing configuration](#Dynamic_update_of_tracing_configuration)
* [Starting tracing spans by HTrace API](#Starting_tracing_spans_by_HTrace_API)
* [Sample code for tracing](#Sample_code_for_tracing)
Dapper-like Tracing in Hadoop
-----------------------------
### HTrace
[HDFS-5274](https://issues.apache.org/jira/browse/HDFS-5274) added support for tracing requests through HDFS,
using the open source tracing library,
[Apache HTrace](https://git-wip-us.apache.org/repos/asf/incubator-htrace.git).
Setting up tracing is quite simple, however it requires some very minor changes to your client code.
### Samplers
Configure the samplers in `core-site.xml` property: `hadoop.htrace.sampler`.
The value can be NeverSampler, AlwaysSampler or ProbabilitySampler.
NeverSampler: HTrace is OFF for all spans;
AlwaysSampler: HTrace is ON for all spans;
ProbabilitySampler: HTrace is ON for some percentage% of top-level spans.
<property>
<name>hadoop.htrace.sampler</name>
<value>NeverSampler</value>
</property>
### SpanReceivers
The tracing system works by collecting information in structs called 'Spans'.
It is up to you to choose how you want to receive this information
by implementing the SpanReceiver interface, which defines one method:
public void receiveSpan(Span span);
Configure what SpanReceivers you'd like to use
by putting a comma separated list of the fully-qualified class name of classes implementing SpanReceiver
in `core-site.xml` property: `hadoop.htrace.spanreceiver.classes`.
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>org.apache.htrace.impl.LocalFileSpanReceiver</value>
</property>
<property>
<name>hadoop.htrace.local-file-span-receiver.path</name>
<value>/var/log/hadoop/htrace.out</value>
</property>
You can omit package name prefix if you use span receiver bundled with HTrace.
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>LocalFileSpanReceiver</value>
</property>
### Setting up ZipkinSpanReceiver
Instead of implementing SpanReceiver by yourself,
you can use `ZipkinSpanReceiver` which uses
[Zipkin](https://github.com/twitter/zipkin) for collecting and displaying tracing data.
In order to use `ZipkinSpanReceiver`,
you need to download and setup [Zipkin](https://github.com/twitter/zipkin) first.
you also need to add the jar of `htrace-zipkin` to the classpath of Hadoop on each node.
Here is example setup procedure.
$ git clone https://github.com/cloudera/htrace
$ cd htrace/htrace-zipkin
$ mvn compile assembly:single
$ cp target/htrace-zipkin-*-jar-with-dependencies.jar $HADOOP_HOME/share/hadoop/common/lib/
The sample configuration for `ZipkinSpanReceiver` is shown below.
By adding these to `core-site.xml` of NameNode and DataNodes, `ZipkinSpanReceiver` is initialized on the startup.
You also need this configuration on the client node in addition to the servers.
<property>
<name>hadoop.htrace.spanreceiver.classes</name>
<value>ZipkinSpanReceiver</value>
</property>
<property>
<name>hadoop.htrace.zipkin.collector-hostname</name>
<value>192.168.1.2</value>
</property>
<property>
<name>hadoop.htrace.zipkin.collector-port</name>
<value>9410</value>
</property>
### Dynamic update of tracing configuration
You can use `hadoop trace` command to see and update the tracing configuration of each servers.
You must specify IPC server address of namenode or datanode by `-host` option.
You need to run the command against all servers if you want to update the configuration of all servers.
`hadoop trace -list` shows list of loaded span receivers associated with the id.
$ hadoop trace -list -host 192.168.56.2:9000
ID CLASS
1 org.apache.htrace.impl.LocalFileSpanReceiver
$ hadoop trace -list -host 192.168.56.2:50020
ID CLASS
1 org.apache.htrace.impl.LocalFileSpanReceiver
`hadoop trace -remove` removes span receiver from server.
`-remove` options takes id of span receiver as argument.
$ hadoop trace -remove 1 -host 192.168.56.2:9000
Removed trace span receiver 1
`hadoop trace -add` adds span receiver to server.
You need to specify the class name of span receiver as argument of `-class` option.
You can specify the configuration associated with span receiver by `-Ckey=value` options.
$ hadoop trace -add -class LocalFileSpanReceiver -Chadoop.htrace.local-file-span-receiver.path=/tmp/htrace.out -host 192.168.56.2:9000
Added trace span receiver 2 with configuration hadoop.htrace.local-file-span-receiver.path = /tmp/htrace.out
$ hadoop trace -list -host 192.168.56.2:9000
ID CLASS
2 org.apache.htrace.impl.LocalFileSpanReceiver
### Starting tracing spans by HTrace API
In order to trace, you will need to wrap the traced logic with **tracing span** as shown below.
When there is running tracing spans,
the tracing information is propagated to servers along with RPC requests.
In addition, you need to initialize `SpanReceiver` once per process.
import org.apache.hadoop.hdfs.HdfsConfiguration;
import org.apache.hadoop.tracing.SpanReceiverHost;
import org.apache.htrace.Sampler;
import org.apache.htrace.Trace;
import org.apache.htrace.TraceScope;
...
SpanReceiverHost.getInstance(new HdfsConfiguration());
...
TraceScope ts = Trace.startSpan("Gets", Sampler.ALWAYS);
try {
... // traced logic
} finally {
if (ts != null) ts.close();
}
### Sample code for tracing
The `TracingFsShell.java` shown below is the wrapper of FsShell
which start tracing span before invoking HDFS shell command.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FsShell;
import org.apache.hadoop.tracing.SpanReceiverHost;
import org.apache.hadoop.util.ToolRunner;
import org.apache.htrace.Sampler;
import org.apache.htrace.Trace;
import org.apache.htrace.TraceScope;
public class TracingFsShell {
public static void main(String argv[]) throws Exception {
Configuration conf = new Configuration();
FsShell shell = new FsShell();
conf.setQuietMode(false);
shell.setConf(conf);
SpanReceiverHost.getInstance(conf);
int res = 0;
TraceScope ts = null;
try {
ts = Trace.startSpan("FsShell", Sampler.ALWAYS);
res = ToolRunner.run(shell, argv);
} finally {
shell.close();
if (ts != null) ts.close();
}
System.exit(res);
}
}
You can compile and execute this code as shown below.
$ javac -cp `hadoop classpath` TracingFsShell.java
$ java -cp .:`hadoop classpath` TracingFsShell -ls /

View File

@ -53,6 +53,7 @@
<item name="Hadoop Commands Reference" href="hadoop-project-dist/hadoop-common/CommandsManual.html"/> <item name="Hadoop Commands Reference" href="hadoop-project-dist/hadoop-common/CommandsManual.html"/>
<item name="FileSystem Shell" href="hadoop-project-dist/hadoop-common/FileSystemShell.html"/> <item name="FileSystem Shell" href="hadoop-project-dist/hadoop-common/FileSystemShell.html"/>
<item name="Hadoop Compatibility" href="hadoop-project-dist/hadoop-common/Compatibility.html"/> <item name="Hadoop Compatibility" href="hadoop-project-dist/hadoop-common/Compatibility.html"/>
<item name="Interface Classification" href="hadoop-project-dist/hadoop-common/InterfaceClassification.html"/>
<item name="FileSystem Specification" <item name="FileSystem Specification"
href="hadoop-project-dist/hadoop-common/filesystem/index.html"/> href="hadoop-project-dist/hadoop-common/filesystem/index.html"/>
</menu> </menu>
@ -61,6 +62,7 @@
<item name="CLI Mini Cluster" href="hadoop-project-dist/hadoop-common/CLIMiniCluster.html"/> <item name="CLI Mini Cluster" href="hadoop-project-dist/hadoop-common/CLIMiniCluster.html"/>
<item name="Native Libraries" href="hadoop-project-dist/hadoop-common/NativeLibraries.html"/> <item name="Native Libraries" href="hadoop-project-dist/hadoop-common/NativeLibraries.html"/>
<item name="Proxy User" href="hadoop-project-dist/hadoop-common/Superusers.html"/> <item name="Proxy User" href="hadoop-project-dist/hadoop-common/Superusers.html"/>
<item name="Rack Awareness" href="hadoop-project-dist/hadoop-common/RackAwareness.html"/>
<item name="Secure Mode" href="hadoop-project-dist/hadoop-common/SecureMode.html"/> <item name="Secure Mode" href="hadoop-project-dist/hadoop-common/SecureMode.html"/>
<item name="Service Level Authorization" href="hadoop-project-dist/hadoop-common/ServiceLevelAuth.html"/> <item name="Service Level Authorization" href="hadoop-project-dist/hadoop-common/ServiceLevelAuth.html"/>
<item name="HTTP Authentication" href="hadoop-project-dist/hadoop-common/HttpAuthentication.html"/> <item name="HTTP Authentication" href="hadoop-project-dist/hadoop-common/HttpAuthentication.html"/>