PEP: 3116 Title: New I/O Version: $Revision$ Last-Modified: $Date$ Author: Daniel Stutzbach, Mike Verdone, Guido van Rossum Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-Feb-2007 Python-Version: 3.0 Post-History: 26-Feb-2007 Rationale and Goals =================== Python allows for a variety of stream-like (a.k.a. file-like) objects that can be used via ``read()`` and ``write()`` calls. Anything that provides ``read()`` and ``write()`` is stream-like. However, more exotic and extremely useful functions like ``readline()`` or ``seek()`` may or may not be available on every stream-like object. Python needs a specification for basic byte-based I/O streams to which we can add buffering and text-handling features. Once we have a defined raw byte-based I/O interface, we can add buffering and text handling layers on top of any byte-based I/O class. The same buffering and text handling logic can be used for files, sockets, byte arrays, or custom I/O classes developed by Python programmers. Developing a standard definition of a stream lets us separate stream-based operations like ``read()`` and ``write()`` from implementation specific operations like ``fileno()`` and ``isatty()``. It encourages programmers to write code that uses streams as streams and not require that all streams support file-specific or socket-specific operations. The new I/O spec is intended to be similar to the Java I/O libraries, but generally less confusing. Programmers who don't want to muck about in the new I/O world can expect that the ``open()`` factory method will produce an object backwards-compatible with old-style file objects. Specification ============= The Python I/O Library will consist of three layers: a raw I/O layer, a buffered I/O layer, and a text I/O layer. Each layer is defined by an abstract base class, which may have multiple implementations. The raw I/O and buffered I/O layers deal with units of bytes, while the text I/O layer deals with units of characters. Raw I/O ======= The abstract base class for raw I/O is RawIOBase. It has several methods which are wrappers around the appropriate operating system calls. If one of these functions would not make sense on the object, the implementation must raise an IOError exception. For example, if a file is opened read-only, the ``.write()`` method will raise an ``IOError``. As another example, if the object represents a socket, then ``.seek()``, ``.tell()``, and ``.truncate()`` will raise an ``IOError``. Generally, a call to one of these functions maps to exactly one operating system call. ``.read(n: int) -> bytes`` Read up to ``n`` bytes from the object and return them. Fewer than ``n`` bytes may be returned if the operating system call returns fewer than ``n`` bytes. If 0 bytes are returned, this indicates end of file. If the object is in non-blocking mode and no bytes are available, the call returns ``None``. ``.readinto(b: bytes) -> int`` Read up to ``len(b)`` bytes from the object and stores them in ``b``, returning the number of bytes read. Like .read, fewer than ``len(b)`` bytes may be read, and 0 indicates end of file. ``None`` is returned if a non-blocking object has no bytes available. The length of ``b`` is never changed. ``.write(b: bytes) -> int`` Returns number of bytes written, which may be ``< len(b)``. ``.seek(pos: int, whence: int = 0) -> int`` ``.tell() -> int`` ``.truncate(n: int = None) -> int`` ``.close() -> None`` Additionally, it defines a few other methods: ``.readable() -> bool`` Returns ``True`` if the object was opened for reading, ``False`` otherwise. If ``False``, ``.read()`` will raise an ``IOError`` if called. ``.writable() -> bool`` Returns ``True`` if the object was opened for writing, ``False`` otherwise. If ``False``, ``.write()`` and ``.truncate()`` will raise an ``IOError`` if called. ``.seekable() -> bool`` Returns ``True`` if the object supports random access (such as disk files), or ``False`` if the object only supports sequential access (such as sockets, pipes, and ttys). If ``False``, ``.seek()``, ``.tell()``, and ``.truncate()`` will raise an IOError if called. ``.__enter__() -> ContextManager`` Context management protocol. Returns ``self``. ``.__exit__(...) -> None`` Context management protocol. Same as ``.close()``. If and only if a ``RawIOBase`` implementation operates on an underlying file descriptor, it must additionally provide a ``.fileno()`` member function. This could be defined specifically by the implementation, or a mix-in class could be used (need to decide about this). ``.fileno() -> int`` Returns the underlying file descriptor (an integer) Initially, three implementations will be provided that implement the ``RawIOBase`` interface: ``FileIO``, ``SocketIO``, and ``ByteIO`` (also ``MMapIO``?). Each implementation must determine whether the object supports random access as the information provided by the user may not be sufficient (consider ``open("/dev/tty", "rw")`` or ``open("/tmp/named-pipe", "rw")``). As an example, ``FileIO`` can determine this by calling the ``seek()`` system call; if it returns an error, the object does not support random access. Each implementation may provided additional methods appropriate to its type. The ``ByteIO`` object is analogous to Python 2's ``cStringIO`` library, but operating on the new bytes type instead of strings. Buffered I/O ============ The next layer is the Buffered I/O layer which provides more efficient access to file-like objects. The abstract base class for all Buffered I/O implementations is ``BufferedIOBase``, which provides similar methods to RawIOBase: ``.read(n: int = -1) -> bytes`` Returns the next ``n`` bytes from the object. It may return fewer than ``n`` bytes if end-of-file is reached or the object is non-blocking. 0 bytes indicates end-of-file. This method may make multiple calls to ``RawIOBase.read()`` to gather the bytes, or may make no calls to ``RawIOBase.read()`` if all of the needed bytes are already buffered. ``.readinto(b: bytes) -> int`` ``.write(b: bytes) -> int`` Write ``b`` bytes to the buffer. The bytes are not guaranteed to be written to the Raw I/O object immediately; they may be buffered. Returns ``len(b)``. ``.seek(pos: int, whence: int = 0) -> int`` ``.tell() -> int`` ``.truncate(pos: int = None) -> int`` ``.flush() -> None`` ``.close() -> None`` ``.readable() -> bool`` ``.writable() -> bool`` ``.seekable() -> bool`` ``.__enter__() -> ContextManager`` ``.__exit__(...) -> None`` Additionally, the abstract base class provides one member variable: ``.raw`` A reference to the underlying ``RawIOBase`` object. The ``BufferedIOBase`` methods signatures are mostly identical to that of ``RawIOBase`` (exceptions: ``write()`` returns ``None``, ``read()``'s argument is optional), but may have different semantics. In particular, ``BufferedIOBase`` implementations may read more data than requested or delay writing data using buffers. For the most part, this will be transparent to the user (unless, for example, they open the same file through a different descriptor). Also, raw reads may return a short read without any particular reason; buffered reads will only return a short read if EOF is reached; and raw writes may return a short count (even when non-blocking I/O is not enabled!), while buffered writes will raise ``IOError`` when not all bytes could be written or buffered. There are four implementations of the ``BufferedIOBase`` abstract base class, described below. ``BufferedReader`` ------------------ The ``BufferedReader`` implementation is for sequential-access read-only objects. Its ``.flush()`` method is a no-op. ``BufferedWriter`` ------------------ The ``BufferedWriter`` implementation is for sequential-access write-only objects. Its ``.flush()`` method forces all cached data to be written to the underlying RawIOBase object. ``BufferedRWPair`` ------------------ The ``BufferedRWPair`` implementation is for sequential-access read-write objects such as sockets and ttys. As the read and write streams of these objects are completely independent, it could be implemented by simply incorporating a ``BufferedReader`` and ``BufferedWriter`` instance. It provides a ``.flush()`` method that has the same semantics as a ``BufferedWriter``'s ``.flush()`` method. ``BufferedRandom`` ------------------ The ``BufferedRandom`` implementation is for all random-access objects, whether they are read-only, write-only, or read-write. Compared to the previous classes that operate on sequential-access objects, the ``BufferedRandom`` class must contend with the user calling ``.seek()`` to reposition the stream. Therefore, an instance of ``BufferedRandom`` must keep track of both the logical and true position within the object. It provides a ``.flush()`` method that forces all cached write data to be written to the underlying ``RawIOBase`` object and all cached read data to be forgotten (so that future reads are forced to go back to the disk). *Q: Do we want to mandate in the specification that switching between reading to writing on a read-write object implies a .flush()? Or is that an implementation convenience that users should not rely on?* For a read-only ``BufferedRandom`` object, ``.writable()`` returns ``False`` and the ``.write()`` and ``.truncate()`` methods throw ``IOError``. For a write-only ``BufferedRandom`` object, ``.readable()`` returns ``False`` and the ``.read()`` method throws ``IOError``. Text I/O ======== The text I/O layer provides functions to read and write strings from streams. Some new features include universal newlines and character set encoding and decoding. The Text I/O layer is defined by a ``TextIOBase`` abstract base class. It provides several methods that are similar to the ``BufferedIOBase`` methods, but operate on a per-character basis instead of a per-byte basis. These methods are: ``.read(n: int = -1) -> str`` ``.write(s: str) -> int`` ``.tell() -> object`` Return a cookie describing the current file position. The only supported use for the cookie is with .seek() with whence set to 0 (i.e. absolute seek). ``.seek(pos: object, whence: int = 0) -> int`` Seek to position ``pos``. If ``pos`` is non-zero, it must be a cookie returned from ``.tell()`` and ``whence`` must be zero. ``.truncate(pos: object = None) -> int`` Like ``BufferedIOBase.truncate()``, except that ``pos`` (if not ``None``) must be a cookie previously returned by ``.tell()``. Unlike with raw I/O, the units for .seek() are not specified - some implementations (e.g. ``StringIO``) use characters and others (e.g. ``TextIOWrapper``) use bytes. The special case for zero is to allow going to the start or end of a stream without a prior ``.tell()``. An implementation could include stream encoder state in the cookie returned from ``.tell()``. ``TextIOBase`` implementations also provide several methods that are pass-throughs to the underlaying ``BufferedIOBase`` objects: ``.flush() -> None`` ``.close() -> None`` ``.readable() -> bool`` ``.writable() -> bool`` ``.seekable() -> bool`` ``TextIOBase`` class implementations additionally provide the following methods: ``.readline() -> str`` Read until newline or EOF and return the line, or ``""`` if EOF hit immediately. ``.__iter__() -> Iterator`` Returns an iterator that returns lines from the file (which happens to be ``self``). ``.next() -> str`` Same as ``readline()`` except raises ``StopIteration`` if EOF hit immediately. Two implementations will be provided by the Python library. The primary implementation, ``TextIOWrapper``, wraps a Buffered I/O object. Each ``TextIOWrapper`` object has a property named "``.buffer``" that provides a reference to the underlying ``BufferedIOBase`` object. Its initializer has the following signature: ``.__init__(self, buffer, encoding=None, newline=None)`` ``buffer`` is a reference to the ``BufferedIOBase`` object to be wrapped with the ``TextIOWrapper``. ``encoding`` refers to an encoding to be used for translating between the byte-representation and character-representation. If it is ``None``, then the system's locale setting will be used as the default. ``newline`` can be ``None``, ``'\n'``, ``'\r'``, or ``'\r\n'`` (all other values are illegal); it indicates the translation for ``'\n'`` characters written. If ``None``, a system-specific default is chosen, i.e., ``'\r\n'`` on Windows and ``'\n'`` on Unix/Linux. Setting ``newline='\n'`` on input means that no CRLF translation is done; lines ending in ``'\r\n'`` will be returned as ``'\r\n'``. (``'\r'`` support is still needed for some OSX applications that produce files using ``'\r'`` line endings; Excel (when exporting to text) and Adobe Illustrator EPS files are the most common examples. Another implementation, ``StringIO``, creates a file-like ``TextIO`` implementation without an underlying Buffered I/O object. While similar functionality could be provided by wrapping a ``BytesIO`` object in a ``TextIOWrapper``, the ``StringIO`` object allows for much greater efficiency as it does not need to actually performing encoding and decoding. A String I/O object can just store the encoded string as-is. The ``StringIO`` object's ``__init__`` signature takes an optional string specifying the initial value; the initial position is always 0. It does not support encodings or newline translations; you always read back exactly the characters you wrote. Unicode encoding/decoding Issues -------------------------------- We should allow passing an error-handling argument whenever an encoding is accepted, and we should allow changing the error-handling setting later. The behavior of Text I/O operations in the face of Unicode problems and ambiguities (e.g. diacritics, surrogates, invalid bytes in an encoding) should be the same as that of the unicode ``encode()``/``decode()`` methods. ``UnicodeError`` may be raised. Implementation note: we should be able to reuse much of the infrastructure provided by the ``codecs`` module. If it doesn't provide the exact APIs we need, we should refactor it to avoid reinventing the wheel. Non-blocking I/O ================ Non-blocking I/O is fully supported on the Raw I/O level only. If a raw object is in non-blocking mode and an operation would block, then ``.read()`` and ``.readinto()`` return ``None``, while ``.write()`` returns 0. In order to put an object in object in non-blocking mode, the user must extract the fileno and do it by hand. At the Buffered I/O and Text I/O layers, if a read or write fails due a non-blocking condition, they raise an ``IOError`` with ``errno`` set to ``EAGAIN``. Originally, we considered propagating up the Raw I/O behavior, but many corner cases and problems were raised. To address these issues, significant changes would need to have been made to the Buffered I/O and Text I/O layers. For example, what should ``.flush()`` do on a Buffered non-blocking object? How would the user instruct the object to "Write as much as you can from your buffer, but don't block"? A non-blocking ``.flush()`` that doesn't necessarily flush all available data is counter-intuitive. Since non-blocking and blocking objects would have such different semantics at these layers, it was agreed to abandon efforts to combine them into a single type. The ``open()`` Built-in Function ================================ The ``open()`` built-in function is specified by the following pseudo-code:: def open(filename, mode="r", buffering=None, *, encoding=None, newline=None): assert isinstance(filename, (str, int)) assert isinstance(mode, str) assert buffering is None or isinstance(buffering, int) assert encoding is None or isinstance(encoding, str) assert newline in (None, "\n", "\r", "\r\n") modes = set(mode) if modes - set("arwb+t") or len(mode) > len(modes): raise ValueError("invalid mode: %r" % mode) reading = "r" in modes writing = "w" in modes binary = "b" in modes appending = "a" in modes updating = "+" in modes text = "t" in modes or not binary if text and binary: raise ValueError("can't have text and binary mode at once") if reading + writing + appending > 1: raise ValueError("can't have read/write/append mode at once") if not (reading or writing or appending): raise ValueError("must have exactly one of read/write/append mode") if binary and encoding is not None: raise ValueError("binary modes doesn't take an encoding arg") if binary and newline is not None: raise ValueError("binary modes doesn't take a newline arg") # XXX Need to spec the signature for FileIO() raw = FileIO(filename, mode) if buffering is None: buffering = 8*1024 # International standard buffer size # XXX Try setting it to fstat().st_blksize if buffering < 0: raise ValueError("invalid buffering size") if buffering == 0: if binary: return raw raise ValueError("can't have unbuffered text I/O") if updating: buffer = BufferedRandom(raw, buffering) elif writing or appending: buffer = BufferedWriter(raw, buffering) else: assert reading buffer = BufferedReader(raw, buffering) if binary: return buffer assert text return TextIOWrapper(buffer, encoding, newline) Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 coding: utf-8 End: