null
if
no such property exists.
Values are processed for variable expansion
before being returned.
@param name the property name.
@return the value of the name
property,
or null if no such property exists.]]>
name
property,
or null if no such property exists.]]>
name
property.
@param name property name.
@param value property value.]]>
defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value, or defaultValue
if the property
doesn't exist.]]>
int
.
If no such property exists, or if the specified value is not a valid
int
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as an int
,
or defaultValue
.]]>
int
.
@param name property name.
@param value int
value of the property.]]>
long
.
If no such property is specified, or if the specified value is not a valid
long
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a long
,
or defaultValue
.]]>
long
.
@param name property name.
@param value long
value of the property.]]>
float
.
If no such property is specified, or if the specified value is not a valid
float
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a float
,
or defaultValue
.]]>
float
.
@param name property name.
@param value property value.]]>
boolean
.
If no such property is specified, or if the specified value is not a valid
boolean
, then defaultValue
is returned.
@param name property name.
@param defaultValue default value.
@return property value as a boolean
,
or defaultValue
.]]>
boolean
.
@param name property name.
@param value boolean
value of the property.]]>
String
s.
If no such property is specified then empty collection is returned.
This is an optimized version of {@link #getStrings(String)}
@param name property name.
@return property value as a collection of String
s.]]>
String
s.
If no such property is specified then null
is returned.
@param name property name.
@return property value as an array of String
s,
or null
.]]>
String
s.
If no such property is specified then default value is returned.
@param name property name.
@param defaultValue The default value
@return property value as an array of String
s,
or default value.]]>
Class
.
The value of the property specifies a list of comma separated class names.
If no such property is specified, then defaultValue
is
returned.
@param name the property name.
@param defaultValue default value.
@return property value as a Class[]
,
or defaultValue
.]]>
Class
.
If no such property is specified, then defaultValue
is
returned.
@param name the class name.
@param defaultValue default value.
@return property value as a Class
,
or defaultValue
.]]>
Class
implementing the interface specified by xface
.
If no such property is specified, then defaultValue
is
returned.
An exception is thrown if the returned class does not implement the named
interface.
@param name the class name.
@param defaultValue default value.
@param xface the interface implemented by the named class.
@return property value as a Class
,
or defaultValue
.]]>
theClass
implementing the given interface xface
.
An exception is thrown if theClass
does not implement the
interface xface
.
@param name property name.
@param theClass property value.
@param xface the interface implemented by the named class.]]>
false
to turn it off.]]>
Configurations are specified by resources. A resource contains a set of
name/value pairs as XML data. Each resource is named by either a
String
or by a {@link Path}. If named by a String
,
then the classpath is examined for a file with that name. If named by a
Path
, then the local filesystem is examined directly, without
referring to the classpath.
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order from the classpath:
Configuration parameters may be declared final.
Once a resource declares a value final, no subsequently-loaded
resource can alter that value.
For example, one might define a final parameter with:
<property>
<name>dfs.client.buffer.dir</name>
<value>/tmp/hadoop/dfs/client</value>
<final>true</final>
</property>
Administrators typically define parameters as final in
core-site.xml for values that user applications may not alter.
Value strings are first processed for variable expansion. The available properties are:
For example, if a configuration resource contains the following property
definitions:
<property>
<name>basedir</name>
<value>/user/${user.name}</value>
</property>
<property>
<name>tempdir</name>
<value>${basedir}/tmp</value>
</property>
When conf.get("tempdir") is called, then ${basedir}
will be resolved to another property in this Configuration, while
${user.name} would then ordinarily be resolved to the value
of the System property with that name.]]>
DistributedCache
is a facility provided by the Map-Reduce
framework to cache files (text, archives, jars etc.) needed by applications.
Applications specify the files, via urls (hdfs:// or http://) to be cached
via the {@link org.apache.hadoop.mapred.JobConf}.
The DistributedCache
assumes that the
files specified via hdfs:// urls are already present on the
{@link FileSystem} at the path specified by the url.
The framework will copy the necessary files on to the slave node before any tasks for the job are executed on that node. Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves.
DistributedCache
can be used to distribute simple, read-only
data/text files and/or more complex types such as archives, jars etc.
Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes.
Jars may be optionally added to the classpath of the tasks, a rudimentary
software distribution mechanism. Files have execution permissions.
Optionally users can also direct it to symlink the distributed cache file(s)
into the working directory of the task.
DistributedCache
tracks modification timestamps of the cache
files. Clearly the cache files should not be modified by the application
or externally while the job is executing.
Here is an illustrative example on how to use the
DistributedCache
:
@see org.apache.hadoop.mapred.JobConf @see org.apache.hadoop.mapred.JobClient]]>// Setting up the cache for the application 1. Copy the requisite files to theFileSystem
: $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz 2. Setup the application'sJobConf
: JobConf job = new JobConf(); DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job); DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job); DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job); DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job); DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job); DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job); 3. Use the cached files in the {@link org.apache.hadoop.mapred.Mapper} or {@link org.apache.hadoop.mapred.Reducer}: public static class MapClass extends MapReduceBase implements Mapper<K, V, K, V> { private Path[] localArchives; private Path[] localFiles; public void configure(JobConf job) { // Get the cached archives/files localArchives = DistributedCache.getLocalCacheArchives(job); localFiles = DistributedCache.getLocalCacheFiles(job); } public void map(K key, V value, OutputCollector<K, V> output, Reporter reporter) throws IOException { // Use data from the cached archives/files here // ... // ... output.collect(k, v); } }
in
, for later use. An internal
buffer array of length size
is created and stored in buf
.
@param in the underlying input stream.
@param size the buffer size.
@exception IllegalArgumentException if size <= 0.]]>
A filename pattern is composed of regular characters and special pattern matching characters, which are:
The local implementation is {@link LocalFileSystem} and distributed implementation is DistributedFileSystem.]]>
FilterFileSystem
itself simply overrides all methods of
FileSystem
with versions that
pass all requests to the contained file
system. Subclasses of FilterFileSystem
may further override some of these methods
and may also provide additional methods
and fields.]]>
offset
and checksum into checksum
.
The method is used for implementing read, therefore, it should be optimized
for sequential reading
@param pos chunkPos
@param buf desitination buffer
@param offset offset in buf at which to store data
@param len maximun number of bytes to read
@return number of bytes read]]>
{@link InputStream#read(byte[], int, int) read}
method of
the {@link InputStream}
class. As an additional
convenience, it attempts to read as many bytes as possible by repeatedly
invoking the read
method of the underlying stream. This
iterated read
continues until one of the following
conditions becomes true: read
method of the underlying stream returns
-1
, indicating end-of-file.
read
on the underlying stream returns
-1
to indicate end-of-file then this method returns
-1
. Otherwise this method returns the number of bytes
actually read.
@param b destination buffer.
@param off offset at which to start storing bytes.
@param len maximum number of bytes to read.
@return the number of bytes read, or -1
if the end of
the stream has been reached.
@exception IOException if an I/O error occurs.
ChecksumException if any checksum error occurs]]>
This method may skip more bytes than are remaining in the backing file. This produces no exception and the number of bytes skipped may include some number of bytes that were beyond the EOF of the backing file. Attempting to read from the stream after skipping past the end will result in -1 indicating the end of the file.
If n
is negative, no bytes are skipped.
@param n the number of bytes to be skipped.
@return the actual number of bytes skipped.
@exception IOException if an I/O error occurs.
ChecksumException if the chunk to skip to is corrupted]]>
stm
@param stm an input stream
@param buf destiniation buffer
@param offset offset at which to store data
@param len number of bytes to read
@return actual number of bytes read
@throws IOException if there is any IO error]]>
off
and generate a checksum for
each data chunk.
This method stores bytes from the given array into this stream's buffer before it gets checksumed. The buffer gets checksumed and flushed to the underlying output stream when all data in a checksum chunk are in the buffer. If the buffer is empty and requested length is at least as large as the size of next checksum chunk size, this method will checksum and write the chunk directly to the underlying output stream. Thus it avoids uneccessary data copy. @param b the data. @param off the start offset in the data. @param len the number of bytes to write. @exception IOException if an I/O error occurs.]]>
pathname
should be included]]>
<property> <name>fs.kfs.impl</name> <value>org.apache.hadoop.fs.kfs.KosmosFileSystem</value> <description>The FileSystem for kfs: uris.</description> </property>
<property> <name>fs.default.name</name> <value>kfs://<server:port></value> </property> <property> <name>fs.kfs.metaServerHost</name> <value><server></value> <description>The location of the KFS meta server.</description> </property> <property> <name>fs.kfs.metaServerPort</name> <value><port></value> <description>The location of the meta server's port.</description> </property>
export LD_LIBRARY_PATH=<path>
All files in the filesystem are migrated by re-writing the block metadata - no datafiles are touched.
]]>Files are stored in S3 as blocks (represented by {@link org.apache.hadoop.fs.s3.Block}), which have an ID and a length. Block metadata is stored in S3 as a small record (represented by {@link org.apache.hadoop.fs.s3.INode}) using the URL-encoded path string as a key. Inodes record the file type (regular file or directory) and the list of blocks. This design makes it easy to seek to any given position in a file by reading the inode data to compute which block to access, then using S3's support for HTTP Range headers to start streaming from the correct position. Renames are also efficient since only the inode is moved (by a DELETE followed by a PUT since S3 does not support renames).
For a single file /dir1/file1 which takes two blocks of storage, the file structure in S3 would be something like this:
/ /dir1 /dir1/file1 block-6415776850131549260 block-3026438247347758425
Inodes start with a leading /
, while blocks are prefixed with block-
.
f
is a file, this method will make a single call to S3.
If f
is a directory, this method will make a maximum of
(n / 1000) + 2 calls to S3, where n is the total number of
files and directories contained directly in f
.
]]>
Typical usage is something like the following:
DataInputBuffer buffer = new DataInputBuffer(); while (... loop condition ...) { byte[] data = ... get data ...; int dataLength = ... get data length ...; buffer.reset(data, dataLength); ... read buffer using DataInput methods ... }]]>
Typical usage is something like the following:
DataOutputBuffer buffer = new DataOutputBuffer(); while (... loop condition ...) { buffer.reset(); ... write buffer using DataOutput methods ... byte[] data = buffer.getData(); int dataLength = buffer.getLength(); ... write data to its ultimate destination ... }]]>
Compared with ObjectWritable
, this class is much more effective,
because ObjectWritable
will append the class declaration as a String
into the output file in every Key-Value pair.
Generic Writable implements {@link Configurable} interface, so that it will be configured by the framework. The configuration is passed to the wrapped objects implementing {@link Configurable} interface before deserialization.
how to use it:getTypes()
, defines
the classes which will be wrapped in GenericObject in application.
Attention: this classes defined in getTypes()
method, must
implement Writable
interface.
@since Nov 8, 2006]]>public class GenericObject extends GenericWritable { private static Class[] CLASSES = { ClassType1.class, ClassType2.class, ClassType3.class, }; protected Class[] getTypes() { return CLASSES; } }
Typical usage is something like the following:
InputBuffer buffer = new InputBuffer(); while (... loop condition ...) { byte[] data = ... get data ...; int dataLength = ... get data length ...; buffer.reset(data, dataLength); ... read buffer using InputStream methods ... }@see DataInputBuffer @see DataOutput]]>
data
file,
containing all keys and values in the map, and a smaller index
file, containing a fraction of the keys. The fraction is determined by
{@link Writer#getIndexInterval()}.
The index file is read entirely into memory. Thus key implementations should try to keep themselves small.
Map files are created by adding entries in-order. To maintain a large database, perform updates by copying the previous version of a database and merging in a sorted change list, to create a new version of the database in a new file. Sorting large change lists can be done with {@link SequenceFile.Sorter}.]]>
val
. Returns true if such a pair exists and false when at
the end of the map]]>
key
. Otherwise,
return the record that sorts just after.
@return - the key that was the closest match or null if eof.]]>
Typical usage is something like the following:
OutputBuffer buffer = new OutputBuffer(); while (... loop condition ...) { buffer.reset(); ... write buffer using OutputStream methods ... byte[] data = buffer.getData(); int dataLength = buffer.getLength(); ... write data to its ultimate destination ... }@see DataOutputBuffer @see InputBuffer]]>
SequenceFile
provides {@link Writer}, {@link Reader} and
{@link Sorter} classes for writing, reading and sorting respectively.
SequenceFile
Writer
s based on the
{@link CompressionType} used to compress key/value pairs:
Writer
: Uncompressed records.
RecordCompressWriter
: Record-compressed files, only compress
values.
BlockCompressWriter
: Block-compressed files, both keys &
values are collected in 'blocks'
separately and compressed. The size of
the 'block' is configurable.
The actual compression algorithm used to compress key and/or values can be specified by using the appropriate {@link CompressionCodec}.
The recommended way is to use the static createWriter methods
provided by the SequenceFile
to chose the preferred format.
The {@link Reader} acts as the bridge and can read any of the above
SequenceFile
formats.
Essentially there are 3 different formats for SequenceFile
s
depending on the CompressionType
specified. All of them share a
common header described below.
CompressionCodec
class which is used for
compression of keys and/or values (if compression is
enabled).
100
bytes or so.
100
bytes or so.
100
bytes or so.
The compressed blocks of key lengths and value lengths consist of the actual lengths of individual keys/values encoded in ZeroCompressedInteger format.
@see CompressionCodec]]>val
. Returns true if such a pair exists and false when at
end of file]]>
key
, or null if no match exists.]]>
start
. The starting
position is measured in bytes and the return value is in
terms of byte position in the buffer. The backing buffer is
not converted to a string for this operation.
@return byte position of the first occurence of the search
string in the UTF-8 buffer or -1 if not found]]>
Also includes utilities for serializing/deserialing a string, coding/decoding a string, checking if a byte array contains valid UTF8 code, calculating the length of an encoded string.]]>
DataOuput
to serialize this object into.
@throws IOException]]>
For efficiency, implementations should attempt to re-use storage in the existing object where possible.
@param inDataInput
to deseriablize this object from.
@throws IOException]]>
key
or value
type in the Hadoop Map-Reduce
framework implements this interface.
Implementations typically implement a static read(DataInput)
method which constructs a new instance, calls {@link #readFields(DataInput)}
and returns the instance.
Example:
]]>public class MyWritable implements Writable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public static MyWritable read(DataInput in) throws IOException { MyWritable w = new MyWritable(); w.readFields(in); return w; } }
WritableComparable
s can be compared to each other, typically
via Comparator
s. Any type which is to be used as a
key
in the Hadoop Map-Reduce framework should implement this
interface.
Example:
]]>public class MyWritableComparable implements WritableComparable { // Some data private int counter; private long timestamp; public void write(DataOutput out) throws IOException { out.writeInt(counter); out.writeLong(timestamp); } public void readFields(DataInput in) throws IOException { counter = in.readInt(); timestamp = in.readLong(); } public int compareTo(MyWritableComparable w) { int thisValue = this.value; int thatValue = ((IntWritable)o).value; return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1)); } }
One may optimize compare-intensive operations by overriding {@link #compare(byte[],int,int,byte[],int,int)}. Static utility methods are provided to assist in optimized implementations of this method.]]>
Compressor
@return Compressor
for the given
CompressionCodec
from the pool or a new one]]>
Decompressor
@return Decompressor
for the given
CompressionCodec
the pool or a new one]]>
true
if a preset dictionary is needed for decompression]]>
CBZip2InputStream reads bytes from the compressed source stream via the single byte {@link java.io.InputStream#read() read()} method exclusively. Thus you should consider to use a buffered source stream.
Instances of this class are not threadsafe.
]]>Attention: The caller is resonsible to write the two BZip2 magic bytes "BZ" to the specified stream prior to calling this constructor.
@param out * the destination stream. @throws IOException if an I/O error occurs in the specified stream. @throws NullPointerException ifout == null
.]]>
Attention: The caller is resonsible to write the two BZip2 magic bytes "BZ" to the specified stream prior to calling this constructor.
@param out the destination stream. @param blockSize the blockSize as 100k units. @throws IOException if an I/O error occurs in the specified stream. @throws IllegalArgumentException if(blockSize < 1) || (blockSize > 9)
.
@throws NullPointerException
if out == null
.
@see #MIN_BLOCKSIZE
@see #MAX_BLOCKSIZE]]>
You can shrink the amount of allocated memory and maybe raise the compression speed by choosing a lower blocksize, which in turn may cause a lower compression ratio. You can avoid unnecessary memory allocation by avoiding using a blocksize which is bigger than the size of the input.
You can compute the memory usage for compressing by the following formula:
<code>400k + (9 * blocksize)</code>.
To get the memory required for decompression by {@link CBZip2InputStream CBZip2InputStream} use
<code>65k + (5 * blocksize)</code>.
Memory usage by blocksize | ||
---|---|---|
Blocksize | Compression memory usage | Decompression memory usage |
100k | 1300k | 565k |
200k | 2200k | 1065k |
300k | 3100k | 1565k |
400k | 4000k | 2065k |
500k | 4900k | 2565k |
600k | 5800k | 3065k |
700k | 6700k | 3565k |
800k | 7600k | 4065k |
900k | 8500k | 4565k |
For decompression CBZip2InputStream allocates less memory if the bzipped input is smaller than one block.
Instances of this class are not threadsafe.
TODO: Update to BZip2 1.0.1
]]>false
]]>
sleepTime
mutliplied by the number of tries so far.
]]>
sleepTime
mutliplied by a random
number in the range of [0, 2 to the number of retries)
]]>
void
methods, or by
re-throwing the exception for non-void
methods.
]]>
true
if the method should be retried,
false
if the method should not be retried
but shouldn't fail with an exception (only for void methods).
@throws Exception The re-thrown exception e
indicating
that the method failed and should not be retried further.]]>
Typical usage is
UnreliableImplementation unreliableImpl = new UnreliableImplementation(); UnreliableInterface unreliable = (UnreliableInterface) RetryProxy.create(UnreliableInterface.class, unreliableImpl, RetryPolicies.retryUpToMaximumCountWithFixedSleep(4, 10, TimeUnit.SECONDS)); unreliable.call();
This will retry any method called on unreliable
four times - in this case the call()
method - sleeping 10 seconds between
each retry. There are a number of {@link org.apache.hadoop.io.retry.RetryPolicies retry policies}
available, or you can implement a custom one by implementing {@link org.apache.hadoop.io.retry.RetryPolicy}.
It is also possible to specify retry policies on a
{@link org.apache.hadoop.io.retry.RetryProxy#create(Class, Object, Map) per-method basis}.
t
is non-null then this deserializer
may set its internal state to the next object read from the input
stream. Otherwise, if the object t
is null a new
deserialized object will be created.
@return the deserialized object]]>
Deserializers are stateful, but must not buffer the input since other producers may read from the input between calls to {@link #deserialize(Object)}.
@paramOne may optimize compare-intensive operations by using a custom implementation of {@link RawComparator} that operates directly on byte representations.
@paramio.serializations
property from conf
, which is a comma-delimited list of
classnames.
]]>
t
to the underlying output stream.]]>
Serializers are stateful, but must not buffer the output since other producers may write to the output between calls to {@link #serialize(Object)}.
@paramTo add a new serialization framework write an implementation of {@link org.apache.hadoop.io.serializer.Serialization} and add its name to the "io.serializations" property.
]]>address
, returning the value. Throws exceptions if there are
network problems or if the remote code threw an exception.
@deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
address
with the ticket
credentials, returning
the value.
Throws exceptions if there are network problems or if the remote code
threw an exception.
@deprecated Use {@link #call(Writable, InetSocketAddress, Class, UserGroupInformation)} instead]]>
address
which is servicing the protocol
protocol,
with the ticket
credentials, returning the value.
Throws exceptions if there are network problems or if the remote code
threw an exception.]]>
Throwable
that has a constructor taking
a String
as a parameter.
Otherwise it returns this.
@return Throwable]]>
boolean
, byte
,
char
, short
, int
, long
,
float
, double
, or void
; orFor the metrics that are sampled and averaged, one must specify a metrics context that does periodic update calls. Most metrics contexts do. The default Null metrics context however does NOT. So if you aren't using any other metrics context then you can turn on the viewing and averaging of sampled metrics by specifying the following two lines in the hadoop-meterics.properties file:
rpc.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread rpc.period=10
Note that the metrics are collected regardless of the context used. The context with the update thread is used to average the data periodically Impl details: We use a dynamic mbean that gets the list of the metrics from the metrics registry passed as an argument to the constructor]]>
{@link #rpcQueueTime}.inc(time)]]>
rpc.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread rpc.period=10
Note that the metrics are collected regardless of the context used. The context with the update thread is used to average the data periodically]]>
org.apache.hadoop.metrics.spi.NullContext
, which is a
dummy "no-op" context which will cause all metric data to be discarded.
@param contextName the name of the context
@return the named MetricsContext]]>
hadoop-metrics.properties
exists on the class path. If it
exists, it must be in the format defined by java.util.Properties, and all
the properties in the file are set as attributes on the newly created
ContextFactory instance.
@return the singleton ContextFactory instance]]>
recordName
is not in that set.
@param recordName the name of the record
@throws MetricsException if recordName conflicts with configuration data]]>
update()
to pass the record to the
client library.
Metric data is not immediately sent to the metrics system
each time that update()
is called.
An internal table is maintained, identified by the record name. This
table has columns
corresponding to the tag and the metric names, and rows
corresponding to each unique set of tag values. An update
either modifies an existing row in the table, or adds a new row with a set of
tag values that are different from all the other rows. Note that if there
are no tags, then there can be at most one row in the table.
Once a row is added to the table, its data will be sent to the metrics system
on every timer period, whether or not it has been updated since the previous
timer period. If this is inappropriate, for example if metrics were being
reported by some transient object in an application, the remove()
method can be used to remove the row and thus stop the data from being
sent.
Note that the update()
method is atomic. This means that it is
safe for different threads to be updating the same metric. More precisely,
it is OK for different threads to call update()
on MetricsRecord instances
with the same set of tag names and tag values. Different threads should
not use the same MetricsRecord instance at the same time.]]>
org.apache.hadoop.metrics.spi
org.apache.hadoop.metrics.file
org.apache.hadoop.metrics.ganglia
private ContextFactory contextFactory = ContextFactory.getFactory(); void reportMyMetric(float myMetric) { MetricsContext myContext = contextFactory.getContext("myContext"); MetricsRecord myRecord = myContext.getRecord("myRecord"); myRecord.setMetric("myMetric", myMetric); myRecord.update(); }In this example there are three names:
private MetricsRecord diskStats = contextFactory.getContext("myContext").getRecord("diskStats"); void reportDiskMetrics(String diskName, float diskBusy, float diskUsed) { diskStats.setTag("diskName", diskName); diskStats.setMetric("diskBusy", diskBusy); diskStats.setMetric("diskUsed", diskUsed); diskStats.update(); }
MetricsRecord.update()
is called. Instead it is stored in an
internal table, and the contents of the table are sent periodically.
This can be important for two reasons:
registerUpdater()
method. The benefit of this
versus using java.util.Timer
is that the callbacks will be done
immediately before sending the data, making the data as current as possible.
ContextFactory factory = ContextFactory.getFactory(); ... examine and/or modify factory attributes ... MetricsContext context = factory.getContext("myContext");The factory attributes can be examined and modified using the following
ContextFactory
methods:
Object getAttribute(String attributeName)
String[] getAttributeNames()
void setAttribute(String name, Object value)
void removeAttribute(attributeName)
ContextFactory.getFactory()
initializes the factory attributes by
reading the properties file hadoop-metrics.properties
if it exists
on the class path.
A factory attribute named:
contextName.classshould have as its value the fully qualified name of the class to be instantiated by a call of the
CodeFactory
method
getContext(contextName)
. If this factory attribute is not
specified, the default is to instantiate
org.apache.hadoop.metrics.file.FileContext
.
Other factory attributes are specific to a particular implementation of this
API and are documented elsewhere. For example, configuration attributes for
the file and Ganglia implementations can be found in the javadoc for
their respective packages.]]>
myContextName.fileName=/tmp/metrics.log myContextName.period=5]]>
recordName
is not in that set.
@param recordName the name of the record
@throws MetricsException if recordName conflicts with configuration data]]>
emitRecord
method in order to transmit
the data. ]]>
remove()
.]]>
org.apache.hadoop.metrics.ganglia
.
Plugging in an implementation involves writing a concrete subclass of
AbstractMetricsContext
. The subclass should get its
configuration information using the getAttribute(attributeName)
method.]]>
socket.getChannel()
returns a non-null channel,
connect is implemented using Hadoop's selectors. This is done mainly
to avoid Sun's connect implementation from creating thread-local
selectors, since Hadoop does not have control on when these are closed
and could end up taking all the available file descriptors.
@see java.net.Socket#connect(java.net.SocketAddress, int)
@param socket
@param endpoint
@param timeout - timeout in milliseconds]]>
recfile = *include module *record
include = "include" path
path = (relative-path / absolute-path)
module = "module" module-name
module-name = name *("." name)
record := "class" name "{" 1*(field) "}"
field := type name ";"
name := ALPHA (ALPHA / DIGIT / "_" )*
type := (ptype / ctype)
ptype := ("byte" / "boolean" / "int" |
"long" / "float" / "double"
"ustring" / "buffer")
ctype := (("vector" "<" type ">") /
("map" "<" type "," type ">" ) ) / name)
A DDL file describes one or more record types. It begins with zero or
more include declarations, a single mandatory module declaration
followed by zero or more class declarations. The semantics of each of
these declarations are described below:
module links {
class Link {
ustring URL;
boolean isRelative;
ustring anchorText;
};
}
include "links.jr"
module outlinks {
class OutLinks {
ustring baseURL;
vector outLinks;
};
}
$ rcc -l C++ ...
namespace hadoop {
enum RecFormat { kBinary, kXML, kCSV };
class InStream {
public:
virtual ssize_t read(void *buf, size_t n) = 0;
};
class OutStream {
public:
virtual ssize_t write(const void *buf, size_t n) = 0;
};
class IOError : public runtime_error {
public:
explicit IOError(const std::string& msg);
};
class IArchive;
class OArchive;
class RecordReader {
public:
RecordReader(InStream& in, RecFormat fmt);
virtual ~RecordReader(void);
virtual void read(Record& rec);
};
class RecordWriter {
public:
RecordWriter(OutStream& out, RecFormat fmt);
virtual ~RecordWriter(void);
virtual void write(Record& rec);
};
class Record {
public:
virtual std::string type(void) const = 0;
virtual std::string signature(void) const = 0;
protected:
virtual bool validate(void) const = 0;
virtual void
serialize(OArchive& oa, const std::string& tag) const = 0;
virtual void
deserialize(IArchive& ia, const std::string& tag) = 0;
};
}
namespace links {
class Link : public hadoop::Record {
// ....
};
};
Each field within the record will cause the generation of a private member
declaration of the appropriate type in the class declaration, and one or more
acccessor methods. The generated class will implement the serialize and
deserialize methods defined in hadoop::Record+. It will also
implement the inspection methods type and signature from
hadoop::Record. A default constructor and virtual destructor will also
be generated. Serialization code will read/write records into streams that
implement the hadoop::InStream and the hadoop::OutStream interfaces.
For each member of a record an accessor method is generated that returns
either the member or a reference to the member. For members that are returned
by value, a setter method is also generated. This is true for primitive
data members of the types byte, int, long, boolean, float and
double. For example, for a int field called MyField the folowing
code is generated.
...
private:
int32_t mMyField;
...
public:
int32_t getMyField(void) const {
return mMyField;
};
void setMyField(int32_t m) {
mMyField = m;
};
...
For a ustring or buffer or composite field. The generated code
only contains accessors that return a reference to the field. A const
and a non-const accessor are generated. For example:
...
private:
std::string mMyBuf;
...
public:
std::string& getMyBuf() {
return mMyBuf;
};
const std::string& getMyBuf() const {
return mMyBuf;
};
...
module inclrec {
class RI {
int I32;
double D;
ustring S;
};
}
and the testrec.jr file contains:
include "inclrec.jr"
module testrec {
class R {
vector VF;
RI Rec;
buffer Buf;
};
}
Then the invocation of rcc such as:
$ rcc -l c++ inclrec.jr testrec.jr
will result in generation of four files:
inclrec.jr.{cc,hh} and testrec.jr.{cc,hh}.
The inclrec.jr.hh will contain:
#ifndef _INCLREC_JR_HH_
#define _INCLREC_JR_HH_
#include "recordio.hh"
namespace inclrec {
class RI : public hadoop::Record {
private:
int32_t I32;
double D;
std::string S;
public:
RI(void);
virtual ~RI(void);
virtual bool operator==(const RI& peer) const;
virtual bool operator<(const RI& peer) const;
virtual int32_t getI32(void) const { return I32; }
virtual void setI32(int32_t v) { I32 = v; }
virtual double getD(void) const { return D; }
virtual void setD(double v) { D = v; }
virtual std::string& getS(void) const { return S; }
virtual const std::string& getS(void) const { return S; }
virtual std::string type(void) const;
virtual std::string signature(void) const;
protected:
virtual void serialize(hadoop::OArchive& a) const;
virtual void deserialize(hadoop::IArchive& a);
};
} // end namespace inclrec
#endif /* _INCLREC_JR_HH_ */
The testrec.jr.hh file will contain:
#ifndef _TESTREC_JR_HH_
#define _TESTREC_JR_HH_
#include "inclrec.jr.hh"
namespace testrec {
class R : public hadoop::Record {
private:
std::vector VF;
inclrec::RI Rec;
std::string Buf;
public:
R(void);
virtual ~R(void);
virtual bool operator==(const R& peer) const;
virtual bool operator<(const R& peer) const;
virtual std::vector& getVF(void) const;
virtual const std::vector& getVF(void) const;
virtual std::string& getBuf(void) const ;
virtual const std::string& getBuf(void) const;
virtual inclrec::RI& getRec(void) const;
virtual const inclrec::RI& getRec(void) const;
virtual bool serialize(hadoop::OutArchive& a) const;
virtual bool deserialize(hadoop::InArchive& a);
virtual std::string type(void) const;
virtual std::string signature(void) const;
};
}; // end namespace testrec
#endif /* _TESTREC_JR_HH_ */
DDL Type C++ Type Java Type
boolean bool boolean
byte int8_t byte
int int32_t int
long int64_t long
float float float
double double double
ustring std::string java.lang.String
buffer std::string org.apache.hadoop.record.Buffer
class type class type class type
vector std::vector java.util.ArrayList
map std::map java.util.TreeMap
record = primitive / struct / vector / map
primitive = boolean / int / long / float / double / ustring / buffer
boolean = "T" / "F"
int = ["-"] 1*DIGIT
long = ";" ["-"] 1*DIGIT
float = ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
double = ";" ["-"] 1*DIGIT "." 1*DIGIT ["E" / "e" ["-"] 1*DIGIT]
ustring = "'" *(UTF8 char except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
buffer = "#" *(BYTE except NULL, LF, % and , / "%00" / "%0a" / "%25" / "%2c" )
struct = "s{" record *("," record) "}"
vector = "v{" [record *("," record)] "}"
map = "m{" [*(record "," record)] "}"
class {
int MY_INT; // value 5
vector MY_VEC; // values 0.1, -0.89, 2.45e4
buffer MY_BUF; // value '\00\n\tabc%'
}
is serialized as
<value>
<struct>
<member>
<name>MY_INT</name>
<value><i4>5</i4></value>
</member>
<member>
<name>MY_VEC</name>
<value>
<array>
<data>
<value><ex:float>0.1</ex:float></value>
<value><ex:float>-0.89</ex:float></value>
<value><ex:float>2.45e4</ex:float></value>
</data>
</array>
</value>
</member>
<member>
<name>MY_BUF</name>
<value><string>%00\n\tabc%25</string></value>
</member>
</struct>
</value>
]]>
The task requires the file
or the nested fileset element to be
specified. Optional attributes are language
(set the output
language, default is "java"),
destdir
(name of the destination directory for generated java/c++
code, default is ".") and failonerror
(specifies error handling
behavior. default is true).
<recordcc destdir="${basedir}/gensrc" language="java"> <fileset include="**\/*.jr" /> </recordcc>]]>
ugi
]]>
conf
as a property attr
The String starts with the user name followed by the default group names,
and other group names.
@param conf configuration
@param attr property name
@param ugi a UnixUserGroupInformation]]>
attr
as a comma separated string that starts
with the user name followed by group names.
If the property name is not defined, return null.
It's assumed that there is only one UGI per user. If this user already
has a UGI in the ugi map, return the ugi in the map.
Otherwise, construct a UGI from the configuration, store it in the
ugi map and return it.
@param conf configuration
@param attr property name
@return a UnixUGI
@throws LoginException if the stored string is ill-formatted.]]>
to parse only the generic Hadoop
arguments.
The array of string arguments other than the generic arguments can be
obtained by {@link #getRemainingArgs()}.
@param conf the Configuration
to modify.
@param args command-line arguments.]]>
CommandLine
object can be obtained by
{@link #getCommandLine()}.
@param conf the configuration to modify
@param options options built by the caller
@param args User-specified arguments]]>
CommandLine
representing list of arguments
parsed against Options descriptor.]]>
GenericOptionsParser
recognizes several standarad command
line arguments, enabling applications to easily specify a namenode, a
jobtracker, additional configuration resources etc.
The supported generic options are:
-conf <configuration file> specify a configuration file -D <property=value> use value for given property -fs <local|namenode:port> specify a namenode -jt <local|jobtracker:port> specify a job tracker -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is:
bin/hadoop command [genericOptions] [commandOptions]
Generic command line arguments might modify
Configuration
objects, given to constructors.
The functionality is implemented using Commons CLI.
Examples:
@see Tool @see ToolRunner]]>$ bin/hadoop dfs -fs darwin:8020 -ls /data list /data directory in dfs with namenode darwin:8020 $ bin/hadoop dfs -D fs.default.name=darwin:8020 -ls /data list /data directory in dfs with namenode darwin:8020 $ bin/hadoop dfs -conf hadoop-site.xml -ls /data list /data directory in dfs with conf specified in hadoop-site.xml $ bin/hadoop job -D mapred.job.tracker=darwin:50020 -submit job.xml submit a job to job tracker darwin:50020 $ bin/hadoop job -jt darwin:50020 -submit job.xml submit a job to job tracker darwin:50020 $ bin/hadoop job -jt local -submit job.xml submit a job to local runner $ bin/hadoop jar -libjars testlib.jar -archives test.tgz -files file.txt inputjar args job submission with libjars, files and archives
T
.
@param Class<T>
]]>
T[]
.
@param c the Class object of the items in the list
@param list the list to convert]]>
T[]
.
@param list the list to convert
@throws ArrayIndexOutOfBoundsException if the list is empty.
Use {@link #toArray(Class, List)} if the list may be empty.]]>
Configuration
.
@param in input stream
@param conf configuration
@throws IOException]]>
false
]]>
false
otherwise.]]>
{ o = pq.pop(); o.change(); pq.push(o); }]]>
Progressable
to explicitly report progress to the Hadoop framework. This is especially
important for operations which take an insignificant amount of time since,
in-lieu of the reported progress, the framework has to assume that an error
has occured and time-out the operation.]]>
Class
of the given object.]]>
null
.
@param conf configuration
@return a String[]
with the ulimit command arguments or
null
if we are running on a non *nix platform or
if the limit is unspecified.]]>
du
or
df
. It also offers facilities to gate commands by
time-intervals.]]>
escapeChar
@param str string
@param escapeChar escape char
@param charToEscape the char to be escaped
@return an escaped string]]>
escapeChar
@param str string
@param escapeChar escape char
@param charToEscape the escaped char
@return an unescaped string]]>
Tool
, is the standard for any Map-Reduce tool/application.
The tool/application should delegate the handling of
standard command-line options to {@link ToolRunner#run(Tool, String[])}
and only handle its custom arguments.
Here is how a typical Tool
is implemented:
@see GenericOptionsParser @see ToolRunner]]>public class MyApp extends Configured implements Tool { public int run(String[] args) throws Exception { //Configuration
processed byToolRunner
Configuration conf = getConf(); // Create a JobConf using the processedconf
JobConf job = new JobConf(conf, MyApp.class); // Process custom command-line options Path in = new Path(args[1]); Path out = new Path(args[2]); // Specify various job-specific parameters job.setJobName("my-app"); job.setInputPath(in); job.setOutputPath(out); job.setMapperClass(MyApp.MyMapper.class); job.setReducerClass(MyApp.MyReducer.class); // Submit the job, then poll for progress until the job is complete JobClient.runJob(job); } public static void main(String[] args) throws Exception { // LetToolRunner
handle generic command-line options int res = ToolRunner.run(new Configuration(), new Sort(), args); System.exit(res); } }
Configuration
, or builds one if null.
Sets the Tool
's configuration with the possibly modified
version of the conf
.
@param conf Configuration
for the Tool
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
Configuration
.
Equivalent to run(tool.getConf(), tool, args)
.
@param tool Tool
to run.
@param args command-line arguments to the tool.
@return exit code of the {@link Tool#run(String[])} method.]]>
ToolRunner
can be used to run classes implementing
Tool
interface. It works in conjunction with
{@link GenericOptionsParser} to parse the
generic hadoop command line arguments and modifies the
Configuration
of the Tool
. The
application-specific options are passed along without being modified.
@see Tool
@see GenericOptionsParser]]>
The Bloom filter is a data structure that was introduced in 1970 and that has been adopted by the networking research community in the past decade thanks to the bandwidth efficiencies that it offers for the transmission of set membership information between networked hosts. A sender encodes the information into a bit vector, the Bloom filter, that is more compact than a conventional representation. Computation and space costs for construction are linear in the number of elements. The receiver uses the filter to test whether various elements are members of the set. Though the filter will occasionally return a false positive, it will never return a false negative. When creating the filter, the sender can choose its desired point in a trade-off between the false positive rate and the size.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Space/Time Trade-Offs in Hash Coding with Allowable Errors]]>
Invariant: nothing happens if the specified key does not belong to this counter Bloom filter. @param key The key to remove.]]>
NOTE: due to the bucket size of this filter, inserting the same
key more than 15 times will cause an overflow at all filter positions
associated with this key, and it will significantly increase the error
rate for this and other keys. For this reason the filter can only be
used to store small count values 0 <= N << 15
.
@param key key to be tested
@return 0 if the key is not present. Otherwise, a positive value v will
be returned such that v == count
with probability equal to the
error rate of this filter, and v > count
otherwise.
Additionally, if the filter experienced an underflow as a result of
{@link #delete(Key)} operation, the return value may be lower than the
count
with the probability of the false negative rate of such
filter.]]>
A counting Bloom filter is an improvement to standard a Bloom filter as it allows dynamic additions and deletions of set membership information. This is achieved through the use of a counting vector instead of a bit vector.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see Summary cache: a scalable wide-area web cache sharing protocol]]>
A dynamic Bloom filter (DBF) makes use of a s * m
bit matrix but
each of the s
rows is a standard Bloom filter. The creation
process of a DBF is iterative. At the start, the DBF is a 1 * m
bit matrix, i.e., it is composed of a single standard Bloom filter.
It assumes that nr
elements are recorded in the
initial bit vector, where nr <= n
(n
is
the cardinality of the set A
to record in the filter).
As the size of A
grows during the execution of the application,
several keys must be inserted in the DBF. When inserting a key into the DBF,
one must first get an active Bloom filter in the matrix. A Bloom filter is
active when the number of recorded keys, nr
, is
strictly less than the current cardinality of A
, n
.
If an active Bloom filter is found, the key is inserted and
nr
is incremented by one. On the other hand, if there
is no active Bloom filter, a new one is created (i.e., a new row is added to
the matrix) according to the current size of A
and the element
is added in this new Bloom filter and the nr
value of
this new Bloom filter is set to one. A given key is said to belong to the
DBF if the k
positions are set to one in one of the matrix rows.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see Theory and Network Applications of Dynamic Bloom Filters]]>
Invariant: The result is assigned to this filter. @param filter The filter to AND with.]]>
Invariant: The result is assigned to this filter. @param filter The filter to OR with.]]>
Invariant: The result is assigned to this filter. @param filter The filter to XOR with.]]>
The result is assigned to this filter.]]>
A
. The
key idea is to map entries of A
(also called keys) into several positions
in a vector through the use of several hash functions.
Typically, a filter will be implemented as a Bloom filter (or a Bloom filter extension).
It must be extended in order to define the real behavior. @see Key The general behavior of a key @see HashFunction A hash function]]>
Invariant: if the false positive is null
, nothing happens.
@param key The false positive key to add.]]>
It allows the removal of selected false positives at the cost of introducing random false negatives, and with the benefit of eliminating some random false positives at the same time.
Originally created by European Commission One-Lab Project 034819. @see Filter The general behavior of a filter @see BloomFilter A Bloom filter @see RemoveScheme The different selective clearing algorithms @see Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives]]>
h = (h & hashmask(10));
In which case, the hash table should have hashsize(10) elements.
If you are hashing n strings byte[][] k, do it like this: for (int i = 0, h = 0; i < n; ++i) h = hash( k[i], h);
By Bob Jenkins, 2006. bob_jenkins@burtleburtle.net. You may use this code any way you wish, private, educational, or commercial. It's free.
Use for hash table lookup, or anything where one collision in 2^^32 is acceptable. Do NOT use for cryptographic purposes.]]>