python-peps/pep-0278.txt

PEP: 278
Title: Universal Newline Support
Version: $Revision$
Last-Modified: $Date$
Author: jack@cwi.nl (Jack Jansen)
Status: Final
Type: Standards Track
Content-Type: text/x-rst
Created: 14-Jan-2002
Python-Version: 2.3
Post-History:


Abstract
========

This PEP discusses a way in which Python can support I/O on files
which have a newline format that is not the native format on the
platform, so that Python on each platform can read and import
files with CR (Macintosh), LF (Unix) or CR LF (Windows) line
endings.

It is more and more common to come across files that have an end
of line that does not match the standard on the current platform:
files downloaded over the net, remotely mounted filesystems on a
different platform, Mac OS X with its double standard of Mac and
Unix line endings, etc.

Many tools such as editors and compilers already handle this
gracefully, it would be good if Python did so too.


Specification
=============

Universal newline support is enabled by default,
but can be disabled during the configure of Python.

In a Python with universal newline support the feature is
automatically enabled for all import statements and ``execfile()``
calls. There is no special support for ``eval()`` or exec.

In a Python with universal newline support ``open()`` the mode
parameter can also be "U", meaning "open for input as a text file
with universal newline interpretation".  Mode "rU" is also allowed,
for symmetry with "rb". Mode "U" cannot be
combined with other mode flags such as "+". Any line ending in the
input file will be seen as a '\n' in Python, so little other code has
to change to handle universal newlines.

Conversion of newlines happens in all calls that read data: ``read()``,
``readline()``, ``readlines()``, etc.

There is no special support for output to file with a different
newline convention, and so mode "wU" is also illegal.

A file object that has been opened in universal newline mode gets
a new attribute "newlines" which reflects the newline convention
used in the file.  The value for this attribute is one of None (no
newline read yet), "\r", "\n", "\r\n" or a tuple containing all the
newline types seen.


Rationale
=========

Universal newline support is implemented in C, not in Python.
This is done because we want files with a foreign newline
convention to be import-able, so a Python Lib directory can be
shared over a remote file system connection, or between MacPython
and Unix-Python on Mac OS X.  For this to be feasible the
universal newline convention needs to have a reasonably small
impact on performance, which means a Python implementation is not
an option as it would bog down all imports. And because of files
with multiple newline conventions, which Visual C++ and other
Windows tools will happily produce, doing a quick check for the
newlines used in a file (handing off the import to C code if a
platform-local newline is seen) will not work.  Finally, a C
implementation also allows tracebacks and such (which open the
Python source module) to be handled easily.

There is no output implementation of universal newlines, Python
programs are expected to handle this by themselves or write files
with platform-local convention otherwise.  The reason for this is
that input is the difficult case, outputting different newlines to
a file is already easy enough in Python.

Also, an output implementation would be much more difficult than an
input implementation, surprisingly: a lot of output is done through
``PyXXX_Print()`` methods, and at this point the file object is not
available anymore, only a ``FILE *``. So, an output implementation would
need to somehow go from the ``FILE*`` to the file object, because that
is where the current newline delimiter is stored.

The input implementation has no such problem: there are no cases in
the Python source tree where files are partially read from C,
partially from Python, and such cases are expected to be rare in
extension modules. If such cases exist the only problem is that the
newlines attribute of the file object is not updated during the
``fread()`` or ``fgets()`` calls that are done direct from C.

A partial output implementation, where strings passed to ``fp.write()``
would be converted to use fp.newlines as their line terminator but
all other output would not is far too surprising, in my view.

Because there is no output support for universal newlines there is
also no support for a mode "rU+": the surprise factor of the
previous paragraph would hold to an even stronger degree.

There is no support for universal newlines in strings passed to
``eval()`` or ``exec``. It is envisioned that such strings always have the
standard ``\n`` line feed, if the strings come from a file that file can
be read with universal newlines.

I think there are no special issues with unicode. utf-16 shouldn't
pose any new problems, as such files need to be opened in binary
mode anyway. Interaction with utf-8 is fine too: values 0x0a and 0x0d
cannot occur as part of a multibyte sequence.

Universal newline files should work fine with iterators and
``xreadlines()`` as these eventually call the normal file
readline/readlines methods.


While universal newlines are automatically enabled for import they
are not for opening, where you have to specifically say open(...,
"U"). This is open to debate, but here are a few reasons for this
design:

- Compatibility.  Programs which already do their own
  interpretation of ``\r\n`` in text files would break. Examples of such
  programs would be editors which warn you when you open a file with
  a different newline convention. If universal newlines was made the
  default such an editor would silently convert your line endings to
  the local convention on save. Programs which open binary files as
  text files on Unix would also break (but it could be argued they
  deserve it :-).

- Interface clarity.  Universal newlines are only supported for
  input files, not for input/output files, as the semantics would
  become muddy.  Would you write Mac newlines if all reads so far
  had encountered Mac newlines?  But what if you then later read a
  Unix newline?

The newlines attribute is included so that programs that really
care about the newline convention, such as text editors, can
examine what was in a file.  They can then save (a copy of) the
file with the same newline convention (or, in case of a file with
mixed newlines, ask the user what to do, or output in platform
convention).

Feedback is explicitly solicited on one item in the reference
implementation: whether or not the universal newlines routines
should grab the global interpreter lock.  Currently they do not,
but this could be considered living dangerously, as they may
modify fields in a ``FileObject``.  But as these routines are
replacements for ``fgets()`` and ``fread()`` as well it may be difficult
to decide whether or not the lock is held when the routine is
called.  Moreover, the only danger is that if two threads read the
same ``FileObject`` at the same time an extraneous newline may be seen
or the "newlines" attribute may inadvertently be set to mixed.  I
would argue that if you read the same ``FileObject`` in two threads
simultaneously you are asking for trouble anyway.

Note that no globally accessible pointers are manipulated in the
``fgets()`` or ``fread()`` replacement routines, just some integer-valued
flags, so the chances of core dumps are zero (he said:-).

Universal newline support can be disabled during configure because it does
have a small performance penalty, and moreover the implementation has
not been tested on all conceivable platforms yet. It might also be silly
on some platforms (WinCE or Palm devices, for instance). If universal
newline support is not enabled then file objects do not have the "newlines"
attribute, so testing whether the current Python has it can be done with a
simple::

   if hasattr(open, 'newlines'):
      print 'We have universal newline support'

Note that this test uses the ``open()`` function rather than the file
type so that it won't fail for versions of Python where the file
type was not available (the file type was added to the built-in
namespace in the same release as the universal newline feature was
added).

Additionally, note that this test fails again on Python versions
>= 2.5, when ``open()`` was made a function again and is not synonymous
with the file type anymore.


Reference Implementation
========================

A reference implementation is available in SourceForge patch
#476814: http://www.python.org/sf/476814


References
==========

None.


Copyright
=========

This document has been placed in the public domain.


..
  Local Variables:
  mode: indented-text
  indent-tabs-mode: nil
  fill-column: 70
  End:
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00			`PEP: 278`
			`Title: Universal Newline Support`
			`Version: $Revision$`
			`Last-Modified: $Date$`
			`Author: jack@cwi.nl (Jack Jansen)`
Jim Jewett points out that PEPs 278, 292, 318, and 324 can be marked as Final, and that 215 can be marked as Rejected. 2005-01-29 13:24:59 -05:00			`Status: Final`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00			`Type: Standards Track`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`Content-Type: text/x-rst`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00			`Created: 14-Jan-2002`
			`Python-Version: 2.3`
			`Post-History:`


			`Abstract`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`========`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`This PEP discusses a way in which Python can support I/O on files`
			`which have a newline format that is not the native format on the`
			`platform, so that Python on each platform can read and import`
			`files with CR (Macintosh), LF (Unix) or CR LF (Windows) line`
			`endings.`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`It is more and more common to come across files that have an end`
			`of line that does not match the standard on the current platform:`
			`files downloaded over the net, remotely mounted filesystems on a`
			`different platform, Mac OS X with its double standard of Mac and`
			`Unix line endings, etc.`

			`Many tools such as editors and compilers already handle this`
			`gracefully, it would be good if Python did so too.`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00

			`Specification`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`=============`

			`Universal newline support is enabled by default,`
			`but can be disabled during the configure of Python.`

			`In a Python with universal newline support the feature is`
			automatically enabled for all import statements and ``execfile()``
			calls. There is no special support for ``eval()`` or exec.

			In a Python with universal newline support ``open()`` the mode
			`parameter can also be "U", meaning "open for input as a text file`
			`with universal newline interpretation". Mode "rU" is also allowed,`
			`for symmetry with "rb". Mode "U" cannot be`
			`combined with other mode flags such as "+". Any line ending in the`
			`input file will be seen as a '\n' in Python, so little other code has`
			`to change to handle universal newlines.`

			Conversion of newlines happens in all calls that read data: ``read()``,
			``readline()``, ``readlines()``, etc.

			`There is no special support for output to file with a different`
			`newline convention, and so mode "wU" is also illegal.`

			`A file object that has been opened in universal newline mode gets`
			`a new attribute "newlines" which reflects the newline convention`
			`used in the file. The value for this attribute is one of None (no`
			`newline read yet), "\r", "\n", "\r\n" or a tuple containing all the`
			`newline types seen.`

Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
			`Rationale`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`=========`

			`Universal newline support is implemented in C, not in Python.`
			`This is done because we want files with a foreign newline`
			`convention to be import-able, so a Python Lib directory can be`
			`shared over a remote file system connection, or between MacPython`
			`and Unix-Python on Mac OS X. For this to be feasible the`
			`universal newline convention needs to have a reasonably small`
			`impact on performance, which means a Python implementation is not`
			`an option as it would bog down all imports. And because of files`
			`with multiple newline conventions, which Visual C++ and other`
			`Windows tools will happily produce, doing a quick check for the`
			`newlines used in a file (handing off the import to C code if a`
			`platform-local newline is seen) will not work. Finally, a C`
			`implementation also allows tracebacks and such (which open the`
			`Python source module) to be handled easily.`

			`There is no output implementation of universal newlines, Python`
			`programs are expected to handle this by themselves or write files`
			`with platform-local convention otherwise. The reason for this is`
			`that input is the difficult case, outputting different newlines to`
			`a file is already easy enough in Python.`

			`Also, an output implementation would be much more difficult than an`
			`input implementation, surprisingly: a lot of output is done through`
			``PyXXX_Print()`` methods, and at this point the file object is not
			available anymore, only a ``FILE *``. So, an output implementation would
			need to somehow go from the ``FILE*`` to the file object, because that
			`is where the current newline delimiter is stored.`

			`The input implementation has no such problem: there are no cases in`
			`the Python source tree where files are partially read from C,`
			`partially from Python, and such cases are expected to be rare in`
			`extension modules. If such cases exist the only problem is that the`
			`newlines attribute of the file object is not updated during the`
			``fread()`` or ``fgets()`` calls that are done direct from C.

			A partial output implementation, where strings passed to ``fp.write()``
			`would be converted to use fp.newlines as their line terminator but`
			`all other output would not is far too surprising, in my view.`

			`Because there is no output support for universal newlines there is`
			`also no support for a mode "rU+": the surprise factor of the`
			`previous paragraph would hold to an even stronger degree.`

			`There is no support for universal newlines in strings passed to`
			``eval()`` or ``exec``. It is envisioned that such strings always have the
			standard ``\n`` line feed, if the strings come from a file that file can
			`be read with universal newlines.`

			`I think there are no special issues with unicode. utf-16 shouldn't`
			`pose any new problems, as such files need to be opened in binary`
			`mode anyway. Interaction with utf-8 is fine too: values 0x0a and 0x0d`
			`cannot occur as part of a multibyte sequence.`

			`Universal newline files should work fine with iterators and`
			``xreadlines()`` as these eventually call the normal file
			`readline/readlines methods.`


			`While universal newlines are automatically enabled for import they`
			`are not for opening, where you have to specifically say open(...,`
			`"U"). This is open to debate, but here are a few reasons for this`
			`design:`

			`- Compatibility. Programs which already do their own`
			interpretation of ``\r\n`` in text files would break. Examples of such
			`programs would be editors which warn you when you open a file with`
			`a different newline convention. If universal newlines was made the`
			`default such an editor would silently convert your line endings to`
			`the local convention on save. Programs which open binary files as`
			`text files on Unix would also break (but it could be argued they`
			`deserve it :-).`

			`- Interface clarity. Universal newlines are only supported for`
			`input files, not for input/output files, as the semantics would`
			`become muddy. Would you write Mac newlines if all reads so far`
			`had encountered Mac newlines? But what if you then later read a`
			`Unix newline?`

			`The newlines attribute is included so that programs that really`
			`care about the newline convention, such as text editors, can`
			`examine what was in a file. They can then save (a copy of) the`
			`file with the same newline convention (or, in case of a file with`
			`mixed newlines, ask the user what to do, or output in platform`
			`convention).`

			`Feedback is explicitly solicited on one item in the reference`
			`implementation: whether or not the universal newlines routines`
			`should grab the global interpreter lock. Currently they do not,`
			`but this could be considered living dangerously, as they may`
			modify fields in a ``FileObject``. But as these routines are
			replacements for ``fgets()`` and ``fread()`` as well it may be difficult
			`to decide whether or not the lock is held when the routine is`
			`called. Moreover, the only danger is that if two threads read the`
			same ``FileObject`` at the same time an extraneous newline may be seen
			`or the "newlines" attribute may inadvertently be set to mixed. I`
			would argue that if you read the same ``FileObject`` in two threads
			`simultaneously you are asking for trouble anyway.`

			`Note that no globally accessible pointers are manipulated in the`
			``fgets()`` or ``fread()`` replacement routines, just some integer-valued
			`flags, so the chances of core dumps are zero (he said:-).`

			`Universal newline support can be disabled during configure because it does`
			`have a small performance penalty, and moreover the implementation has`
			`not been tested on all conceivable platforms yet. It might also be silly`
			`on some platforms (WinCE or Palm devices, for instance). If universal`
			`newline support is not enabled then file objects do not have the "newlines"`
			`attribute, so testing whether the current Python has it can be done with a`
			`simple::`

			`if hasattr(open, 'newlines'):`
			`print 'We have universal newline support'`

			Note that this test uses the ``open()`` function rather than the file
			`type so that it won't fail for versions of Python where the file`
			`type was not available (the file type was added to the built-in`
			`namespace in the same release as the universal newline feature was`
			`added).`

			`Additionally, note that this test fails again on Python versions`
			>= 2.5, when ``open()`` was made a function again and is not synonymous
			`with the file type anymore.`

Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
			`Reference Implementation`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`========================`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`A reference implementation is available in SourceForge patch`
			`#476814: http://www.python.org/sf/476814`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00

			`References`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`==========`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`None.`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00

			`Copyright`
Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`=========`

			`This document has been placed in the public domain.`
Added PEP 278, Universal Newline Support, Jack Jansen 2002-01-23 08:24:26 -05:00


Convert 12 PEPs (#187) 2017-01-24 15:47:22 -05:00			`..`
			`Local Variables:`
			`mode: indented-text`
			`indent-tabs-mode: nil`
			`fill-column: 70`
			`End:`