PEP: 278 Title: Universal Newline Support Version: $Revision$ Last-Modified: $Date$ Author: jack@cwi.nl (Jack Jansen) Status: Draft Type: Standards Track Created: 14-Jan-2002 Python-Version: 2.3 Post-History: Abstract This PEP discusses a way in which Python can support I/O on files which have a newline format that is not the native format on the platform, so that Python on each platform can read and import files with CR (Macintosh), LF (Unix) or CR LF (Windows) line endings. It is more and more common to come across files that have an end of line that does not match the standard on the current platform: files downloaded over the net, remotely mounted filesystems on a different platform, Mac OS X with its double standard of Mac and Unix line endings, etc. Many tools such as editors and compilers already handle this gracefully, it would be good if Python did so too. Specification Universal newline support needs to be enabled during the configure of Python. In a Python with universal newline support the feature is automatically enabled for all import statements and source() calls. In a Python with universal newline support open() the mode parameter can also be "t", meaning "open for input as a text file with universal newline interpretation". Mode "t" cannot be combined with other mode flags such as "+". Any line ending in the input file will be seen as a '\n' in Python, so little other code has to change to handle universal newlines. There is no special support for output to file with a different newline convention. A file object that has been opened in universal newline mode gets a new attribute "newlines" which reflects the newline convention used in the file. The value for this attribute is one of None (no newline read yet), "\r", "\n", "\r\n" or "mixed" (multiple different types of newlines seen). Rationale Universal newline support is implemented in C, not in Python. This is done because we want files with a foreign newline convention to be import-able, so a Python Lib directory can be shared over a remote file system connection, or between MacPython and Unix-Python on Mac OS X. For this to be feasible the universal newline convention needs to have a reasonably small impact on performance, which means a Python implementation is not an option as it would bog down all imports. And because of files with multiple newline conventions, which Visual C++ and other Windows tools will happily produce, doing a quick check for the newlines used in a file (handing off the import to C code if a platform-local newline is seen) will not work. Finally, a C implementation also allows tracebacks and such (which open the Python source module) to be handled easily. Universal newline support is implemented (for this release) as a compile time option because there is a performance penalty, even though it should be a small one. There is no output implementation of universal newlines, Python programs are expected to handle this by themselves or write files with platform-local convention otherwise. The reason for this is that input is the difficult case, outputting different newlines to a file is already easy enough in Python. It would also slow down all "normal" Python output, even if only a little. While universal newlines are automatically enabled for import they are not for opening, where you have to specifically say open(..., "t"). This is open to debate, but here are a few reasons for this design: - Compatibility. Programs which already do their own interpretation of \r\n in text files would break. Programs which open binary files as text files on Unix would also break (but it could be argued they deserve it :-). - Interface clarity. Universal newlines are only supported for input files, not for input/output files, as the semantics would become muddy. Would you write Mac newlines if all reads so far had encountered Mac newlines? But what if you then later read a Unix newline? The newlines attribute is included so that programs that really care about the newline convention, such as text editors, can examine what was in a file. They can then save (a copy of) the file with the same newline convention (or, in case of a file with mixed newlines, ask the user what to do, or output in platform convention). Feedback is explicitly solicited on one item in the reference implementation: whether or not the universal newlines routines should grab the global interpreter lock. Currently they do not, but this could be considered living dangerously, as they may modify fields in a FileObject. But as these routines are replacements for fgets() and fread() as well it may be difficult to decide whether or not the lock is held when the routine is called. Moreover, the only danger is that if two threads read the same FileObject at the same time an extraneous newline may be seen or the "newlines" attribute may inadvertently be set to mixed. I would argue that if you read the same FileObject in two threads simultaneously you are asking for trouble anyway. Reference Implementation A reference implementation is available in SourceForge patch #476814. References None. Copyright This document has been placed in the public domain. Local Variables: mode: indented-text indent-tabs-mode: nil fill-column: 70 End: