true
if the key is deprecated and
false
otherwise.]]>
null
if
no such property exists. If the key is deprecated, it returns the value of
the first key which replaces the deprecated key and is not null.
Values are processed for variable expansion
before being returned.
@param name the property name, will be trimmed before get value.
@return the value of the name
or its replacing property,
or null if no such property exists.]]>
name
exists without value]]>
String
,
null
if no such property exists.
If the key is deprecated, it returns the value of
the first key which replaces the deprecated key and is not null
Values are processed for variable expansion
before being returned.
@param name the property name.
@return the value of the name
or its replacing property,
or null if no such property exists.]]>
String
,
defaultValue
if no such property exists.
See @{Configuration#getTrimmed} for more details.
@param name the property name.
@param defaultValue the property default value.
@return the value of the name
or defaultValue
if it is not set.]]>
name
property or
its replacing property and null if no such property exists.]]>
name
property. If
name
is deprecated or there is a deprecated name associated to it,
it sets the value to both names. Name will be trimmed before put into
configuration.
@param name property name.
@param value property value.]]>
name
property. If
name
is deprecated, it also sets the value
to
the keys that replace the deprecated key. Name will be trimmed before put
into configuration.
@param name property name.
@param value property value.
@param source the place that this configuration value came from
(For debugging).
@throws IllegalArgumentException when the value or name is null.]]>
defaultValue
is returned.
@param name property name, will be trimmed before get value.
@param defaultValue default value.
@return property value, or defaultValue
if the property
doesn't exist.]]>
int
.
If no such property exists, the provided default value is returned,
or if the specified value is not a valid int
,
then an error is thrown.
@param name property name.
@param defaultValue default value.
@throws NumberFormatException when the value is invalid
@return property value as an int
,
or defaultValue
.]]>
int
values.
If no such property exists, an empty array is returned.
@param name property name
@return property value interpreted as an array of comma-delimited
int
values]]>
int
.
@param name property name.
@param value int
value of the property.]]>
long
.
If no such property exists, the provided default value is returned,
or if the specified value is not a valid long
,
then an error is thrown.
@param name property name.
@param defaultValue default value.
@throws NumberFormatException when the value is invalid
@return property value as a long
,
or defaultValue
.]]>
long
or
human readable format. If no such property exists, the provided default
value is returned, or if the specified value is not a valid
long
or human readable format, then an error is thrown. You
can use the following suffix (case insensitive): k(kilo), m(mega), g(giga),
t(tera), p(peta), e(exa)
@param name property name.
@param defaultValue default value.
@throws NumberFormatException when the value is invalid
@return property value as a long
,
or defaultValue
.]]>
long
.
@param name property name.
@param value long
value of the property.]]>
float
.
If no such property exists, the provided default value is returned,
or if the specified value is not a valid float
,
then an error is thrown.
@param name property name.
@param defaultValue default value.
@throws NumberFormatException when the value is invalid
@return property value as a float
,
or defaultValue
.]]>
float
.
@param name property name.
@param value property value.]]>
double
.
If no such property exists, the provided default value is returned,
or if the specified value is not a valid double
,
then an error is thrown.
@param name property name.
@param defaultValue default value.
@throws NumberFormatException when the value is invalid
@return property value as a double
,
or defaultValue
.]]>
double
.
@param name property name.
@param value property value.]]>
boolean
.
If no such property is specified, or if the specified value is not a valid
boolean
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a boolean
,
or defaultValue
.]]>
boolean
.
@param name property name.
@param value boolean
value of the property.]]>
set(<name>, value.toString())
.
@param name property name
@param value new value]]>
set(<name>, value + <time suffix>)
.
@param name Property name
@param value Time duration
@param unit Unit of time]]>
Pattern
.
If no such property is specified, or if the specified value is not a valid
Pattern
, then DefaultValue
is returned.
Note that the returned value is NOT trimmed by this method.
@param name property name
@param defaultValue default value
@return property value as a compiled Pattern, or defaultValue]]>
String
s.
If no such property is specified then empty collection is returned.
This is an optimized version of {@link #getStrings(String)}
@param name property name.
@return property value as a collection of String
s.]]>
String
s.
If no such property is specified then null
is returned.
@param name property name.
@return property value as an array of String
s,
or null
.]]>
String
s.
If no such property is specified then default value is returned.
@param name property name.
@param defaultValue The default value
@return property value as an array of String
s,
or default value.]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then empty Collection
is returned.
@param name property name.
@return property value as a collection of String
s, or empty Collection
]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then an empty array is returned.
@param name property name.
@return property value as an array of trimmed String
s,
or empty array.]]>
String
s, trimmed of the leading and trailing whitespace.
If no such property is specified then default value is returned.
@param name property name.
@param defaultValue The default value
@return property value as an array of trimmed String
s,
or default value.]]>
InetSocketAddress
. If hostProperty
is
null
, addressProperty
will be used. This
is useful for cases where we want to differentiate between host
bind address and address clients should use to establish connection.
@param hostProperty bind host property name.
@param addressProperty address property name.
@param defaultAddressValue the default value
@param defaultPort the default port
@return InetSocketAddress]]>
InetSocketAddress
.
@param name property name.
@param defaultAddress the default value
@param defaultPort the default port
@return InetSocketAddress]]>
host:port
.]]>
host:port
. The wildcard
address is replaced with the local host's address. If the host and address
properties are configured the host component of the address will be combined
with the port component of the addr to generate the address. This is to allow
optional control over which host name is used in multi-home bind-host
cases where a host can have multiple names
@param hostProperty the bind-host configuration name
@param addressProperty the service address configuration name
@param defaultAddressValue the service default address configuration value
@param addr InetSocketAddress of the service listener
@return InetSocketAddress for clients to connect]]>
host:port
. The wildcard
address is replaced with the local host's address.
@param name property name.
@param addr InetSocketAddress of a listener to store in the given property
@return InetSocketAddress for clients to connect]]>
Class
.
The value of the property specifies a list of comma separated class names.
If no such property is specified, then defaultValue
is
returned.
@param name the property name.
@param defaultValue default value.
@return property value as a Class[]
,
or defaultValue
.]]>
Class
.
If no such property is specified, then defaultValue
is
returned.
@param name the class name.
@param defaultValue default value.
@return property value as a Class
,
or defaultValue
.]]>
Class
implementing the interface specified by xface
.
If no such property is specified, then defaultValue
is
returned.
An exception is thrown if the returned class does not implement the named
interface.
@param name the class name.
@param defaultValue default value.
@param xface the interface implemented by the named class.
@return property value as a Class
,
or defaultValue
.]]>
List
of objects implementing the interface specified by xface
.
An exception is thrown if any of the classes does not exist, or if it does
not implement the named interface.
@param name the property name.
@param xface the interface implemented by the classes named by
name
.
@return a List
of objects implementing xface
.]]>
theClass
implementing the given interface xface
.
An exception is thrown if theClass
does not implement the
interface xface
.
@param name property name.
@param theClass property value.
@param xface the interface implemented by the named class.]]>
false
to turn it off.]]>
Configurations are specified by resources. A resource contains a set of
name/value pairs as XML data. Each resource is named by either a
String
or by a {@link Path}. If named by a String
,
then the classpath is examined for a file with that name. If named by a
Path
, then the local filesystem is examined directly, without
referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
Configuration parameters may be declared final.
Once a resource declares a value final, no subsequently-loaded
resource can alter that value.
For example, one might define a final parameter with:
<property>
<name>dfs.hosts.include</name>
<value>/etc/hadoop/conf/hosts.include</value>
<final>true</final>
</property>
Administrators typically define parameters as final in
core-site.xml for values that user applications may not alter.
Value strings are first processed for variable expansion. The available properties are:
For example, if a configuration resource contains the following property
definitions:
<property>
<name>basedir</name>
<value>/user/${user.name}</value>
</property>
<property>
<name>tempdir</name>
<value>${basedir}/tmp</value>
</property>
When conf.get("tempdir") is called, then ${basedir}
will be resolved to another property in this Configuration, while
${user.name} would then ordinarily be resolved to the value
of the System property with that name.
By default, warnings will be given to any deprecated configuration
parameters and these are suppressible by configuring
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation in
log4j.properties file.]]>
KeyProvider
implementations must be thread safe.]]>
uri
is not supported.]]>
EnumSet.of(CreateFlag.CREATE, CreateFlag.APPEND)
Use the CreateFlag as follows:
- CREATE - to create a file if it does not exist,
else throw FileAlreadyExists.
- APPEND - to append to a file if it exists,
else throw FileNotFoundException.
- OVERWRITE - to truncate a file if it exists,
else throw FileNotFoundException.
- CREATE|APPEND - to create a file if it does not exist,
else append to an existing file.
- CREATE|OVERWRITE - to create a file if it does not exist,
else overwrite an existing file.
- SYNC_BLOCK - to force closed blocks to the disk device.
In addition {@link Syncable#hsync()} should be called after each write,
if true synchronous behavior is required.
- LAZY_PERSIST - Create the block on transient storage (RAM) if
available.
- APPEND_NEWBLOCK - Append data to a new block instead of end of the last
partial block.
Following combination is not valid and will result in
{@link HadoopIllegalArgumentException}:
- APPEND|OVERWRITE
- CREATE|APPEND|OVERWRITE
]]>
absOrFqPath
could
not be instantiated.]]>
f
is not valid]]>
f
already exists
@throws FileNotFoundException If parent of f
does not exist
and createParent
is false
@throws ParentNotDirectoryException If parent of f
is not a
directory.
@throws UnsupportedFileSystemException If file system for f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path f
is not valid]]>
dir
does not exist
and createParent
is false
@throws ParentNotDirectoryException If parent of dir
is not a
directory
@throws UnsupportedFileSystemException If file system for dir
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path dir
is not valid]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws InvalidPathException If path f
is invalid]]>
f
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
true
if the file has been truncated to the desired
newLength
and is immediately available to be reused for
write operations such as append
, or
false
if a background process of adjusting the length of
the last block has been started, and clients should wait for it to
complete before proceeding with further file updates.
@throws AccessControlException If access is denied
@throws FileNotFoundException If file f
does not exist
@throws UnsupportedFileSystemException If file system for f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
If OVERWRITE option is not passed as an argument, rename fails if the dst already exists.
If OVERWRITE option is passed as an argument, rename overwrites the dst if it is a file or an empty directory. Rename fails if dst is a non-empty directory.
Note that atomicity of rename is dependent on the file system implementation. Please refer to the file system documentation for details
@param src path to be renamed
@param dst new path after rename
@throws AccessControlException If access is denied
@throws FileAlreadyExistsException If dst
already exists and
options has {@link Options.Rename#OVERWRITE}
option false.
@throws FileNotFoundException If
src
does not exist
@throws ParentNotDirectoryException If parent of dst
is not a
directory
@throws UnsupportedFileSystemException If file system for src
and dst
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server
RuntimeExceptions:
@throws HadoopIllegalArgumentException If username
or
groupname
is invalid.]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred]]>
f
is
not supported
@throws IOException If the given path does not refer to a symlink
or an I/O error occurred]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
linkcode> already exists
@throws FileNotFoundException If target
does not exist
@throws ParentNotDirectoryException If parent of link
is not a
directory.
@throws UnsupportedFileSystemException If file system for
target
or link
is not supported
@throws IOException If an I/O error occurred]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
f
is
not supported
@throws IOException If an I/O error occurred
Exceptions applicable to file systems accessed over RPC:
@throws RpcClientException If an exception occurred in the RPC client
@throws RpcServerException If an exception occurred in the RPC server
@throws UnexpectedServerException If server implementation throws
undeclared exception to RPC server]]>
The Hadoop file system supports a URI name space and URI names. It offers a forest of file systems that can be referenced using fully qualified URIs. Two common Hadoop file systems implementations are
To facilitate this, Hadoop supports a notion of a default file system. The user can set his default file system, although this is typically set up for you in your environment via your default config. A default file system implies a default scheme and authority; slash-relative names (such as /for/bar) are resolved relative to that default FS. Similarly a user can also have working-directory-relative names (i.e. names not starting with a slash). While the working directory is generally in the same default FS, the wd can be in a different FS.
Hence Hadoop path names can be one of:
****The Role of the FileContext and configuration defaults****
The FileContext provides file namespace context for resolving file names; it also contains the umask for permissions, In that sense it is like the per-process file-related state in Unix system. These two properties
The file system related SS defaults are
*** Usage Model for the FileContext class ***
Example 1: use the default config read from the $HADOOP_CONFIG/core.xml. Unspecified values come from core-defaults.xml in the release jar.
UnsupportedOperationException
.
@return the protocol scheme for the FileSystem.]]>
If OVERWRITE option is not passed as an argument, rename fails if the dst already exists.
If OVERWRITE option is passed as an argument, rename overwrites the dst if it is a file or an empty directory. Rename fails if dst is a non-empty directory.
Note that atomicity of rename is dependent on the file system implementation. Please refer to the file system documentation for details. This default implementation is non atomic.
This method is deprecated since it is a temporary method added to support the transition from FileSystem to FileContext for user applications. @param src path to be renamed @param dst new path after rename @throws IOException on failure]]>
true
if the file has been truncated to the desired
newLength
and is immediately available to be reused for
write operations such as append
, or
false
if a background process of adjusting the length of
the last block has been started, and clients should wait for it to
complete before proceeding with further file updates.]]>
A filename pattern is composed of regular characters and special pattern matching characters, which are:
The local implementation is {@link LocalFileSystem} and distributed implementation is DistributedFileSystem.]]>
FilterFileSystem
itself simply overrides all methods of
FileSystem
with versions that
pass all requests to the contained file
system. Subclasses of FilterFileSystem
may further override some of these methods
and may also provide additional methods
and fields.]]>
file
]]>
pathname
should be included]]>
ftp
]]>
viewfs
]]>
To use viewfs one would typically set the default file system in the config (i.e. fs.default.name< = viewfs:///) along with the mount table config variables as described below.
** Config variables to specify the mount table entries **
The file system is initialized from the standard Hadoop config through config variables. See {@link FsConstants} for URI and Scheme constants; See {@link Constants} for config var constants; see {@link ConfigUtil} for convenient lib.
All the mount table config entries for view fs are prefixed by fs.viewfs.mounttable. For example the above example can be specified with the following config variables:
**** Merge Mounts **** (NOTE: merge mounts are not implemented yet.)
One can also use "MergeMounts" to merge several directories (this is sometimes called union-mounts or junction-mounts in the literature. For example of the home directories are stored on say two file systems (because they do not fit on one) then one could specify a mount entry such as following merges two dirs:
Fencing is configured by the operator as an ordered list of methods to attempt. Each method will be tried in turn, and the next in the list will only be attempted if the previous one fails. See {@link NodeFencer} for more information.
If an implementation also implements {@link Configurable} then its
setConf
method will be called upon instantiation.]]>
Compared with ObjectWritable
, this class is much more effective,
because ObjectWritable
will append the class declaration as a String
into the output file in every Key-Value pair.
Generic Writable implements {@link Configurable} interface, so that it will be configured by the framework. The configuration is passed to the wrapped objects implementing {@link Configurable} interface before deserialization.
how to use it:getTypes()
, defines
the classes which will be wrapped in GenericObject in application.
Attention: this classes defined in getTypes()
method, must
implement Writable
interface.
@since Nov 8, 2006]]>public class GenericObject extends GenericWritable { private static Class[] CLASSES = { ClassType1.class, ClassType2.class, ClassType3.class, }; protected Class[] getTypes() { return CLASSES; } }
data
file,
containing all keys and values in the map, and a smaller index
file, containing a fraction of the keys. The fraction is determined by
{@link Writer#getIndexInterval()}.
The index file is read entirely into memory. Thus key implementations should try to keep themselves small.
Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with {@link SequenceFile.Sorter}.]]>
SequenceFile
provides {@link SequenceFile.Writer},
{@link SequenceFile.Reader} and {@link Sorter} classes for writing,
reading and sorting respectively.
SequenceFile
Writer
s based on the
{@link CompressionType} used to compress key/value pairs:
Writer
: Uncompressed records.
RecordCompressWriter
: Record-compressed files, only compress
values.
BlockCompressWriter
: Block-compressed files, both keys &
values are collected in 'blocks'
separately and compressed. The size of
the 'block' is configurable.
The actual compression algorithm used to compress key and/or values can be specified by using the appropriate {@link CompressionCodec}.
The recommended way is to use the static createWriter methods
provided by the SequenceFile
to chose the preferred format.
The {@link SequenceFile.Reader} acts as the bridge and can read any of the
above SequenceFile
formats.
Essentially there are 3 different formats for SequenceFile
s
depending on the CompressionType
specified. All of them share a
common header described below.
CompressionCodec
class which is used for
compression of keys and/or values (if compression is
enabled).
100
bytes or so.
100
bytes or so.
The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.
@see CompressionCodec]]>start
. The starting
position is measured in bytes and the return value is in
terms of byte position in the buffer. The backing buffer is
not converted to a string for this operation.
@return byte position of the first occurence of the search
string in the UTF-8 buffer or -1 if not found]]>
new byte[0]
).]]>
Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.]]>
DataOuput
to serialize this object into.
@throws IOException]]>
For efficiency, implementations should attempt to re-use storage in the existing object where possible.
@param inDataInput
to deseriablize this object from.
@throws IOException]]>
key
or value
type in the Hadoop Map-Reduce
framework implements this interface.
Implementations typically implement a static read(DataInput)
method which constructs a new instance, calls {@link #readFields(DataInput)}
and returns the instance.
Example:
]]>public class MyWritable implements Writable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public static MyWritable read(DataInput in) throws IOException { MyWritable w = new MyWritable(); w.readFields(in); return w; } }
WritableComparable
s can be compared to each other, typically
via Comparator
s. Any type which is to be used as a
key
in the Hadoop Map-Reduce framework should implement this
interface.
Note that hashCode()
is frequently used in Hadoop to partition
keys. It's important that your implementation of hashCode() returns the same
result across different instances of the JVM. Note also that the default
hashCode()
implementation in Object
does not
satisfy this property.
Example:
]]>public class MyWritableComparable implements WritableComparable{ // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public int compareTo(MyWritableComparable o) { int thisValue = this.value; int thatValue = o.value; return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } public int hashCode() { final int prime = 31; int result = 1; result = prime * result + counter; result = prime * result + (int) (timestamp ^ (timestamp >>> 32)); return result } }
One may optimize compare-intensive operations by overriding {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are provided to assist in optimized implementations of this method.]]>
Compressor
@param conf the Configuration
object which contains confs for creating or reinit the compressor
@return Compressor
for the given
CompressionCodec
from the pool or a new one]]>
Decompressor
@return Decompressor
for the given
CompressionCodec
the pool or a new one]]>
b[]
remain unmodified until
the caller is explicitly notified--via {@link #needsInput()}--that the
buffer may be safely modified. With this requirement, an extra
buffer-copy can be avoided.)
@param b Input data
@param off Start offset
@param len Length]]>
true
if the input data buffer is empty and
{@link #setInput(byte[], int, int)} should be called in
order to provide more input.]]>
true
if a preset dictionary is needed for decompression]]>
true
and {@link #getRemaining()}
returns a positive value. finished() will be reset with the
{@link #reset()} method.
@return true
if the end of the decompressed
data output stream has been reached.]]>
true
and getRemaining() returns
a zero value, indicates that the end of data stream has been reached and
is not a concatenated data stream.
@return The number of bytes remaining in the compressed data buffer.]]>
false
when reset() is called.]]>
The behavior of TFile can be customized by the following variables through Configuration:
Suggestions on performance optimization.
To add a new serialization framework write an implementation of {@link org.apache.hadoop.io.serializer.Serialization} and add its name to the "io.serializations" property.
]]>Use {@link org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization} for serialization of classes generated by Avro's 'specific' compiler.
Use {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} for other classes. {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization} work for any class which is either in the package list configured via {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerialization#AVRO_REFLECT_PACKAGES} or implement {@link org.apache.hadoop.io.serializer.avro.AvroReflectSerializable} interface.
]]>org.apache.hadoop.metrics.spi
org.apache.hadoop.metrics.file
org.apache.hadoop.metrics.ganglia
private ContextFactory contextFactory = ContextFactory.getFactory(); void reportMyMetric(float myMetric) { MetricsContext myContext = contextFactory.getContext("myContext"); MetricsRecord myRecord = myContext.getRecord("myRecord"); myRecord.setMetric("myMetric", myMetric); myRecord.update(); }In this example there are three names:
private MetricsRecord diskStats = contextFactory.getContext("myContext").getRecord("diskStats"); void reportDiskMetrics(String diskName, float diskBusy, float diskUsed) { diskStats.setTag("diskName", diskName); diskStats.setMetric("diskBusy", diskBusy); diskStats.setMetric("diskUsed", diskUsed); diskStats.update(); }
MetricsRecord.update()
is called. Instead it is stored in an
internal table, and the contents of the table are sent periodically.
This can be important for two reasons:
registerUpdater()
method. The benefit of this
versus using java.util.Timer
is that the callbacks will be done
immediately before sending the data, making the data as current as possible.
ContextFactory factory = ContextFactory.getFactory(); ... examine and/or modify factory attributes ... MetricsContext context = factory.getContext("myContext");The factory attributes can be examined and modified using the following
ContextFactory
methods:
Object getAttribute(String attributeName)
String[] getAttributeNames()
void setAttribute(String name, Object value)
void removeAttribute(attributeName)
ContextFactory.getFactory()
initializes the factory attributes by
reading the properties file hadoop-metrics.properties
if it exists
on the class path.
A factory attribute named:
contextName.classshould have as its value the fully qualified name of the class to be instantiated by a call of the
CodeFactory
method
getContext(contextName)
. If this factory attribute is not
specified, the default is to instantiate
org.apache.hadoop.metrics.file.FileContext
.
Other factory attributes are specific to a particular implementation of this
API and are documented elsewhere. For example, configuration attributes for
the file and Ganglia implementations can be found in the javadoc for
their respective packages.]]>
recordName
is not in that set.
@param recordName the name of the record
@throws MetricsException if recordName conflicts with configuration data]]>
emitRecord
method in order to transmit
the data. ]]>
remove()
.]]>
org.apache.hadoop.metrics.ganglia
.
Plugging in an implementation involves writing a concrete subclass of
AbstractMetricsContext
. The subclass should get its
configuration information using the getAttribute(attributeName)
method.]]>
Configured
base class, and should not be changed to do so, as it causes problems
for subclasses. The constructor of the Configured
calls
the {@link #setConf(Configuration)} method, which will call into the
subclasses before they have been fully constructed.]]>
RawScriptBasedMapping
that performs
the work: reading the configuration parameters, executing any defined
script, handling errors and such like. The outer
class extends {@link CachedDNSToSwitchMapping} to cache the delegated
queries.
This DNS mapper's {@link #isSingleSwitch()} predicate returns
true if and only if a script is defined.]]>
This class uses the configuration parameter {@code net.topology.table.file.name} to locate the mapping file.
Calls to {@link #resolve(List)} will look up the address as defined in the mapping file. If no entry corresponding to the address is found, the value {@code /default-rack} is returned.
]]>DEPRECATED: Replaced by Avro.
recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name := ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
"long" / "float" / "double"
"ustring" / "buffer")
ctype := (("vector" "<" type ">") /
("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or
more include declarations, a single mandatory module declaration
followed by zero or more class declarations. The semantics of each of
these declarations are described below:
module links {
class Link {
ustring URL;
boolean isRelative;
ustring anchorText;
};
}
include "links.jr"
module outlinks {
class OutLinks {
ustring baseURL;
vector outLinks;
};
}
$ rcc -l C++ ...
namespace hadoop {
enum RecFormat { kBinary, kXML, kCSV };
class InStream {
public:
virtual ssize_t read(void *buf, size_t n) = 0;
};
class OutStream {
public:
virtual ssize_t write(const void *buf, size_t n) = 0;
};
class IOError : public runtime_error {
public:
explicit IOError(const std::string& msg);
};
class IArchive;
class OArchive;
class RecordReader {
public:
RecordReader(InStream& in, RecFormat fmt);
virtual ~RecordReader(void);
virtual void read(Record& rec);
};
class RecordWriter {
public:
RecordWriter(OutStream& out, RecFormat fmt);
virtual ~RecordWriter(void);
virtual void write(Record& rec);
};
class Record {
public:
virtual std::string type(void) const = 0;
virtual std::string signature(void) const = 0;
protected:
virtual bool validate(void) const = 0;
virtual void
serialize(OArchive& oa, const std::string& tag) const = 0;
virtual void
deserialize(IArchive& ia, const std::string& tag) = 0;
};
}
namespace links {
class Link : public hadoop::Record {
// ....
};
};
Each field within the record will cause the generation of a private member
declaration of the appropriate type in the class declaration, and one or more
acccessor methods. The generated class will implement the serialize and
deserialize methods defined in hadoop::Record+. It will also
implement the inspection methods type and signature from
hadoop::Record. A default constructor and virtual destructor will also
be generated. Serialization code will read/write records into streams that
implement the hadoop::InStream and the hadoop::OutStream interfaces.
For each member of a record an accessor method is generated that returns
either the member or a reference to the member. For members that are returned
by value, a setter method is also generated. This is true for primitive
data members of the types byte, int, long, boolean, float and
double. For example, for a int field called MyField the folowing
code is generated.
...
private:
int32_t mMyField;
...
public:
int32_t getMyField(void) const {
return mMyField;
};
void setMyField(int32_t m) {
mMyField = m;
};
...
For a ustring or buffer or composite field. The generated code
only contains accessors that return a reference to the field. A const
and a non-const accessor are generated. For example:
...
private:
std::string mMyBuf;
...
public:
std::string& getMyBuf() {
return mMyBuf;
};
const std::string& getMyBuf() const {
return mMyBuf;
};
...
module inclrec {
class RI {
int I32;
double D;
ustring S;
};
}
and the testrec.jr file contains:
include "inclrec.jr"
module testrec {
class R {
vector VF;
RI Rec;
buffer Buf;
};
}
Then the invocation of rcc such as:
$ rcc -l c++ inclrec.jr testrec.jr
will result in generation of four files:
inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}.
The inclrec.jr.hh will contain:
#ifndef _INCLREC_JR_HH_
#define _INCLREC_JR_HH_
#include "recordio.hh"
namespace inclrec {
class RI : public hadoop::Record {
private:
int32_t I32;
double D;
std::string S;
public:
RI(void);
virtual ~RI(void);
virtual bool operator==(const RI& peer) const;
virtual bool operator<(const RI& peer) const;
virtual int32_t getI32(void) const { return I32; }
virtual void setI32(int32_t v) { I32 = v; }
virtual double getD(void) const { return D; }
virtual void setD(double v) { D = v; }
virtual std::string& getS(void) const { return S; }
virtual const std::string& getS(void) const { return S; }
virtual std::string type(void) const;
virtual std::string signature(void) const;
protected:
virtual void serialize(hadoop::OArchive& a) const;
virtual void deserialize(hadoop::IArchive& a);
};
} // end namespace inclrec
#endif /* _INCLREC_JR_HH_ */
The testrec.jr.hh file will contain:
#ifndef _TESTREC_JR_HH_
#define _TESTREC_JR_HH_
#include "inclrec.jr.hh"
namespace testrec {
class R : public hadoop::Record {
private:
std::vector VF;
inclrec::RI Rec;
std::string Buf;
public:
R(void);
virtual ~R(void);
virtual bool operator==(const R& peer) const;
virtual bool operator<(const R& peer) const;
virtual std::vector& getVF(void) const;
virtual const std::vector& getVF(void) const;
virtual std::string& getBuf(void) const ;
virtual const std::string& getBuf(void) const;
virtual inclrec::RI& getRec(void) const;
virtual const inclrec::RI& getRec(void) const;
virtual bool serialize(hadoop::OutArchive& a) const;
virtual bool deserialize(hadoop::InArchive& a);
virtual std::string type(void) const;
virtual std::string signature(void) const;
};
}; // end namespace testrec
#endif /* _TESTREC_JR_HH_ */
DDL Type C++ Type Java Type
boolean bool boolean
byte int8_t byte
int int32_t int
long int64_t long
float float float
double double double
ustring std::string java.lang.String
buffer std::string org.apache.hadoop.record.Buffer
class type class type class type
vector std::vector java.util.ArrayList
map std::map java.util.TreeMap
record = primitive / struct / vector / map
primitive = boolean / int / long / float / double / ustring / buffer
boolean = "T" / "F"
int = ["-"] 1*DIGIT
long = ";" ["-"] 1*DIGIT
float = ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
double = ";" ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
ustring = "'" *(UTF8 char except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
buffer = "#" *(BYTE except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
struct = "s{" record *("," record) "}"
vector = "v{" [record *("," record)] "}"
map = "m{" [*(record "," record)] "}"
class {
int MY_INT; // value 5
vector MY_VEC; // values 0.1, -0.89, 2.45e4
buffer MY_BUF; // value '\00\n\tabc%'
}
is serialized as
<value>
<struct>
<member>
<name>MY_INT</name>
<value><i4>5</i4></value>
</member>
<member>
<name>MY_VEC</name>
<value>
<array>
<data>
<value><ex:float>0.1</ex:float></value>
<value><ex:float>-0.89</ex:float></value>
<value><ex:float>2.45e4</ex:float></value>
</data>
</array>
</value>
</member>
<member>
<name>MY_BUF</name>
<value><string>%00\n\tabc%25</string></value>
</member>
</struct>
</value>
]]>
DEPRECATED: Replaced by Avro.
]]> The task requires the file
or the nested fileset element to be
specified. Optional attributes are language
(set the output
language, default is "java"),
destdir
(name of the destination directory for generated java/c++
code, default is ".") and failonerror
(specifies error handling
behavior. default is true).
<recordcc destdir="${basedir}/gensrc" language="java"> <fileset include="**\/*.jr" /> </recordcc>@deprecated Replaced by Avro.]]>
DEPRECATED: Replaced by Avro.
]]>null
the default one will be used.]]>
null
the default one will be used.
@param connConfigurator a connection configurator.]]>
TRUE
if the token is transmitted in the
URL query string, FALSE
if the delegation token is transmitted
using the {@link DelegationTokenAuthenticator#DELEGATION_TOKEN_HEADER} HTTP
header.]]>
FALSE
if the delegation token is transmitted using the
{@link DelegationTokenAuthenticator#DELEGATION_TOKEN_HEADER} HTTP header.]]>
doAs
parameter is not NULL,
the request will be done on behalf of the specified doAs
user.
@param url the URL to connect to. Only HTTP/S URLs are supported.
@param token the authentication token being used for the user.
@param doAs user to do the the request on behalf of, if NULL the request is
as self.
@return an authenticated {@link HttpURLConnection}.
@throws IOException if an IO error occurred.
@throws AuthenticationException if an authentication exception occurred.]]>
AuthenticatedURL
instances are not thread-safe.]]>
Progressable
to explicitly report progress to the Hadoop framework. This is especially
important for operations which take significant amount of time since,
in-lieu of the reported progress, the framework has to assume that an error
has occured and time-out the operation.]]>
Class
of the given object.]]>
Tool
, is the standard for any Map-Reduce tool/application.
The tool/application should delegate the handling of
standard command-line options to {@link ToolRunner#run(Tool, String[])}
and only handle its custom arguments.
Here is how a typical Tool
is implemented:
@see GenericOptionsParser @see ToolRunner]]>public class MyApp extends Configured implements Tool { public int run(String[] args) throws Exception { //Configuration
processed byToolRunner
Configuration conf = getConf(); // Create a JobConf using the processedconf
JobConf job = new JobConf(conf, MyApp.class); // Process custom command-line options Path in = new Path(args[1]); Path out = new Path(args[2]); // Specify various job-specific parameters job.setJobName("my-app"); job.setInputPath(in); job.setOutputPath(out); job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class); // Submit the job, then poll for progress until the job is complete JobClient.runJob(job); return 0; } public static void main(String[] args) throws Exception { // LetToolRunner
handle generic command-line options int res = ToolRunner.run(new Configuration(), new MyApp(), args); System.exit(res); } }
Configuration
, or builds one if null.
Sets the Tool
's configuration with the possibly modified
version of the conf
.
@param conf Configuration
for the Tool
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
Configuration
.
Equivalent to run(tool.getConf(), tool, args)
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
ToolRunner
can be used to run classes implementing
Tool
interface. It works in conjunction with
{@link GenericOptionsParser} to parse the
generic hadoop command line arguments and modifies the
Configuration
of the Tool
. The
application-specific options are passed along without being modified.
@see Tool
@see GenericOptionsParser]]>
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Space/Time Trade-Offs in Hash Coding with Allowable Errors]]>
Invariant: nothing happens if the specified key does not belong to this counter Bloom filter. @param key The key to remove.]]>
NOTE: due to the bucket size of this filter, inserting the same
key more than 15 times will cause an overflow at all filter positions
associated with this key, and it will significantly increase the error
rate for this and other keys. For this reason the filter can only be
used to store small count values 0 <= N << 15
.
@param key key to be tested
@return 0 if the key is not present. Otherwise, a positive value v will
be returned such that v == count
with probability equal to the
error rate of this filter, and v > count
otherwise.
Additionally, if the filter experienced an underflow as a result of
{@link #delete(Key)} operation, the return value may be lower than the
count
with the probability of the false negative rate of such
filter.]]>
A counting Bloom filter is an improvement to standard a Bloom filter as it allows dynamic additions and deletions of set membership information. This is achieved through the use of a counting vector instead of a bit vector.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Summary cache: a scalable wide-area web cache sharing protocol]]>
A dynamic Bloom filter (DBF) makes use of a s * m
bit matrix but
each of the s
rows is a standard Bloom filter. The creation
process of a DBF is iterative. At the start, the DBF is a 1 * m
bit matrix, i.e., it is composed of a single standard Bloom filter.
It assumes that nr
elements are recorded in the
initial bit vector, where nr <= n
(n
is
the cardinality of the set A
to record in the filter).
As the size of A
grows during the execution of the application,
several keys must be inserted in the DBF. When inserting a key into the DBF,
one must first get an active Bloom filter in the matrix. A Bloom filter is
active when the number of recorded keys, nr
, is
strictly less than the current cardinality of A
, n
.
If an active Bloom filter is found, the key is inserted and
nr
is incremented by one. On the other hand, if there
is no active Bloom filter, a new one is created (i.e., a new row is added to
the matrix) according to the current size of A
and the element
is added in this new Bloom filter and the nr
value of
this new Bloom filter is set to one. A given key is said to belong to the
DBF if the k
positions are set to one in one of the matrix rows.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see Theory and Network Applications of Dynamic Bloom Filters]]>
Invariant: if the false positive is null
, nothing happens.
@param key The false positive key to add.]]>
It allows the removal of selected false positives at the cost of introducing random false negatives, and with the benefit of eliminating some random false positives at the same time.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see RemoveScheme The different selective clearing algorithms @see Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives]]>