<ahref="http://maven.apache.org/"title="Built by Maven"class="poweredBy">
<imgalt="Built by Maven"src="../images/logos/maven-feather.png"/>
</a>
</div>
</div>
<divid="bodyColumn">
<divid="contentBox">
<!---
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. See accompanying LICENSE file.
-->
<h1>Statistic collection with the IOStatistics API</h1>
<divclass="source">
<divclass="source">
<pre>@InterfaceAudience.Public
@InterfaceStability.Unstable
</pre></div></div>
<p>The <code>IOStatistics</code> API is intended to provide statistics on individual IO classes -such as input and output streams, <i>in a standard way which applications can query</i></p>
<p>Many filesystem-related classes have implemented statistics gathering and provided private/unstable ways to query this, but as they were not common across implementations it was unsafe for applications to reference these values. Example: <code>S3AInputStream</code> and its statistics API. This is used in internal tests, but cannot be used downstream in applications such as Apache Hive or Apache HBase.</p>
<p>The IOStatistics API is intended to</p>
<olstyle="list-style-type: decimal">
<li>Be instance specific:, rather than shared across multiple instances of a class, or thread local.</li>
<li>Be public and stable enough to be used by applications.</li>
<li>Be easy to use in applications written in Java, Scala, and, via libhdfs, C/C++</li>
<li>Have foundational interfaces and classes in the <code>hadoop-common</code> JAR.</li>
</ol><section>
<h2><aname="Core_Model"></a>Core Model</h2>
<p>Any class <i>may</i> implement <code>IOStatisticsSource</code> in order to provide statistics.</p>
<p>Wrapper I/O Classes such as <code>FSDataInputStream</code> anc <code>FSDataOutputStream</code><i>should</i> implement the interface and forward it to the wrapped class, if they also implement it -and return <code>null</code> if they do not.</p>
<p><code>IOStatisticsSource</code> implementations <code>getIOStatistics()</code> return an instance of <code>IOStatistics</code> enumerating the statistics of that specific instance.</p>
<p>The <code>IOStatistics</code> Interface exports five kinds of statistic:</p>
<tableborder="0"class="bodyTable">
<thead>
<trclass="a">
<th> Category </th>
<th> Type </th>
<th> Description </th></tr>
</thead><tbody>
<trclass="b">
<td><code>counter</code></td>
<td><code>long</code></td>
<td> a counter which may increase in value; SHOULD BE >= 0 </td></tr>
<trclass="a">
<td><code>gauge</code></td>
<td><code>long</code></td>
<td> an arbitrary value which can down as well as up; SHOULD BE >= 0 </td></tr>
<trclass="b">
<td><code>minimum</code></td>
<td><code>long</code></td>
<td> an minimum value; MAY BE negative </td></tr>
<trclass="a">
<td><code>maximum</code></td>
<td><code>long</code></td>
<td> a maximum value; MAY BE negative </td></tr>
<trclass="b">
<td><code>meanStatistic</code></td>
<td><code>MeanStatistic</code></td>
<td> an arithmetic mean and sample size; mean MAY BE negative </td></tr>
</tbody>
</table>
<p>Four are simple <code>long</code> values, with the variations how they are likely to change and how they are aggregated.</p><section><section>
<h4><aname="Aggregation_of_Statistic_Values"></a>Aggregation of Statistic Values</h4>
<p>For the different statistic category, the result of <code>aggregate(x, y)</code> is</p>
<tableborder="0"class="bodyTable">
<thead>
<trclass="a">
<th> Category </th>
<th> Aggregation </th></tr>
</thead><tbody>
<trclass="b">
<td><code>counter</code></td>
<td><code>max(0, x) + max(0, y)</code></td></tr>
<trclass="a">
<td><code>gauge</code></td>
<td><code>max(0, x) + max(0, y)</code></td></tr>
<trclass="b">
<td><code>minimum</code></td>
<td><code>min(x, y)</code></td></tr>
<trclass="a">
<td><code>maximum</code></td>
<td><code>max(x, y)</code></td></tr>
<trclass="b">
<td><code>meanStatistic</code></td>
<td> calculation of the mean of <code>x</code> and <code>y</code> ) </td></tr>
<p>This package contains the public statistics APIs intended for use by applications.</p><!-- ============================================================= -->
* These statistics MUST be instance specific, not thread local.
*/
@InterfaceStability.Unstable
public interface IOStatisticsSource {
/**
* Return a statistics instance.
* It is not a requirement that the same instance is returned every time.
* {@link IOStatisticsSource}.
* If the object implementing this is Closeable, this method
* may return null if invoked on a closed object, even if
* it returns a valid instance when called earlier.
* @return an IOStatistics instance or null
*/
IOStatistics getIOStatistics();
}
</pre></div></div>
<p>This is the interface which an object instance MUST implement if they are a source of IOStatistics information.</p><section>
<h4><aname="Invariants"></a>Invariants</h4>
<p>The result of <code>getIOStatistics()</code> must be one of</p>
<ul>
<li><code>null</code></li>
<li>an immutable <code>IOStatistics</code> for which each map of entries is an empty map.</li>
<li>an instance of an <code>IOStatistics</code> whose statistics MUST BE unique to that instance of the class implementing <code>IOStatisticsSource</code>.</li>
</ul>
<p>Less formally: if the statistics maps returned are non-empty, all the statistics must be collected from the current instance, and not from any other instances, the way some of the <code>FileSystem</code> statistics are collected.</p>
<p>The result of <code>getIOStatistics()</code>, if non-null, MAY be a different instance on every invocation.</p><!-- ============================================================= -->
<p>The naming policy of statistics is designed to be readable, shareable and ideally consistent across <code>IOStatisticSource</code> implementations.</p>
<ul>
<li>
<p>Characters in key names MUST match the regular expression <code>[a-z|0-9|_]</code> with the exception of the first character, which MUST be in the range <code>[a-z]</code>. Thus the full regular expression for a valid statistic name is:</p>
<divclass="source">
<divclass="source">
<pre>[a-z][a-z|0-9|_]+
</pre></div></div>
</li>
<li>
<p>Where possible, the names of statistics SHOULD be those defined with common names.</p>
<p>Note 1.: these are evolving; for clients to safely reference their statistics by name they SHOULD be copied to the application. (i.e. for an application compiled hadoop 3.4.2 to link against hadoop 3.4.1, copy the strings).</p>
<p>Note 2: keys defined in these classes SHALL NOT be removed from subsequent Hadoop releases.</p>
<ul>
<li>
<p>A common statistic name MUST NOT be used to report any other statistic and MUST use the pre-defined unit of measurement.</p>
</li>
<li>
<p>A statistic name in one of the maps SHOULD NOT be re-used in another map. This aids diagnostics of logged statistics.</p>
<p>The operations to add/remove entries are unsupported: the map returned MAY be mutable by the source of statistics.</p>
</li>
<li>
<p>The map MAY be empty.</p>
</li>
<li>
<p>The map keys each represent a measured statistic.</p>
</li>
<li>
<p>The set of keys in a map SHOULD remain unchanged, and MUST NOT remove keys.</p>
</li>
<li>
<p>The statistics SHOULD be dynamic: every lookup of an entry SHOULD return the latest value.</p>
</li>
<li>
<p>The values MAY change across invocations of <code>Map.values()</code> and <code>Map.entries()</code></p>
</li>
<li>
<p>The update MAY be in the <code>iterable()</code> calls of the iterators returned, or MAY be in the actual <code>iterable.next()</code> operation. That is: there is no guarantee as to when the evaluation takes place.</p>
</li>
<li>
<p>The returned <code>Map.Entry</code> instances MUST return the same value on repeated <code>getValue()</code> calls. (i.e once you have the entry, it is immutable).</p>
</li>
<li>
<p>Queries of statistics SHOULD be fast and non-blocking to the extent that if invoked during a long operation, they will prioritize returning fast over most timely values.</p>
</li>
<li>
<p>The statistics MAY lag; especially for statistics collected in separate operations (e.g stream IO statistics as provided by a filesystem instance).</p>
</li>
<li>
<p>Statistics which represent time SHOULD use milliseconds as their unit.</p>
</li>
<li>
<p>Statistics which represent time and use a different unit MUST document the unit used.</p>
</li>
</ul></section><section>
<h3><aname="Thread_Model"></a>Thread Model</h3>
<olstyle="list-style-type: decimal">
<li>
<p>An instance of <code>IOStatistics</code> can be shared across threads;</p>
</li>
<li>
<p>Read access to the supplied statistics maps MUST be thread safe.</p>
</li>
<li>
<p>Iterators returned from the maps MUST NOT be shared across threads.</p>
</li>
<li>
<p>The statistics collected MUST include all operations which took place across all threads performing work for the monitored object.</p>
</li>
<li>
<p>The statistics reported MUST NOT be local to the active thread.</p>
</li>
</ol>
<p>This is different from the <code>FileSystem.Statistics</code> behavior where per-thread statistics are collected and reported.</p>
<p>That mechanism supports collecting limited read/write statistics for different worker threads sharing the same FS instance, but as the collection is thread local, it invariably under-reports IO performed in other threads on behalf of a worker thread.</p></section></section><section>
<p>Support for efficiently logging <code>IOStatistics</code>/<code>IOStatisticsSource</code> instances.</p>
<p>These are intended for assisting logging, including only enumerating the state of an <code>IOStatistics</code> instance when the log level needs it.</p>
<divclass="source">
<divclass="source">
<pre>LOG.info("IOStatistics after upload: {}", demandStringify(iostats));
// or even better, as it results in only a single object creations
<p>This contains implementation classes to support providing statistics to applications.</p>
<p>These MUST NOT BE used by applications. If a feature is needed from this package then the provisioning of a public implementation MAY BE raised via the Hadoop development channels.</p>
<p>These MAY be used by those implementations of the Hadoop <code>FileSystem</code>, <code>AbstractFileSystem</code> and related classes which are not in the hadoop source tree. Implementors MUST BE aware that the implementation this code is unstable and may change across minor point releases of Hadoop.</p></section>