Added PEP 305, CSV file API
This commit is contained in:
parent
7b1b08a413
commit
935a1711b3
|
@ -105,7 +105,8 @@ Index by Category
|
||||||
S 301 Package Index and Metadata for Distutils Jones
|
S 301 Package Index and Metadata for Distutils Jones
|
||||||
S 302 New Import Hooks JvR
|
S 302 New Import Hooks JvR
|
||||||
S 303 Extend divmod() for Multiple Divisors Bellman
|
S 303 Extend divmod() for Multiple Divisors Bellman
|
||||||
S 304 Controlling generation of bytecode files Montanaro
|
S 304 Controlling Generation of Bytecode Files Montanaro
|
||||||
|
I 305 CSV File API Montanaro, Altis, Wells
|
||||||
|
|
||||||
Finished PEPs (done, implemented in CVS)
|
Finished PEPs (done, implemented in CVS)
|
||||||
|
|
||||||
|
@ -299,7 +300,8 @@ Numerical Index
|
||||||
S 301 Package Index and Metadata for Distutils Jones
|
S 301 Package Index and Metadata for Distutils Jones
|
||||||
S 302 New Import Hooks JvR
|
S 302 New Import Hooks JvR
|
||||||
S 303 Extend divmod() for Multiple Divisors Bellman
|
S 303 Extend divmod() for Multiple Divisors Bellman
|
||||||
S 304 Controlling generation of bytecode files Montanaro
|
S 304 Controlling Generation of Bytecode Files Montanaro
|
||||||
|
I 305 CSV File API Montanaro, Altis, Wells
|
||||||
SR 666 Reject Foolish Indentation Creighton
|
SR 666 Reject Foolish Indentation Creighton
|
||||||
|
|
||||||
|
|
||||||
|
@ -320,6 +322,7 @@ Owners
|
||||||
Aahz aahz@pobox.com
|
Aahz aahz@pobox.com
|
||||||
Ahlstrom, James C. jim@interet.com
|
Ahlstrom, James C. jim@interet.com
|
||||||
Althoff, Jim james_althoff@i2.com
|
Althoff, Jim james_althoff@i2.com
|
||||||
|
Altis, Kevin altis@semi-retired.com
|
||||||
Ascher, David davida@activestate.com
|
Ascher, David davida@activestate.com
|
||||||
Barrett, Paul barrett@stsci.edu
|
Barrett, Paul barrett@stsci.edu
|
||||||
Baxter, Anthony anthony@interlink.com.au
|
Baxter, Anthony anthony@interlink.com.au
|
||||||
|
@ -372,6 +375,7 @@ Owners
|
||||||
Stein, Greg gstein@lyra.org
|
Stein, Greg gstein@lyra.org
|
||||||
Tirosh, Oren oren at hishome.net
|
Tirosh, Oren oren at hishome.net
|
||||||
Warsaw, Barry barry@zope.com
|
Warsaw, Barry barry@zope.com
|
||||||
|
Wells, Cliff LogiplexSoftware@earthlink.net
|
||||||
Wilson, Greg gvwilson@ddj.com
|
Wilson, Greg gvwilson@ddj.com
|
||||||
Wouters, Thomas thomas@xs4all.net
|
Wouters, Thomas thomas@xs4all.net
|
||||||
Yee, Ka-Ping ping@lfw.org
|
Yee, Ka-Ping ping@lfw.org
|
||||||
|
|
|
@ -0,0 +1,230 @@
|
||||||
|
PEP: 305
|
||||||
|
Title: CSV file API
|
||||||
|
Version: $Revision$
|
||||||
|
Last-Modified: $Date$
|
||||||
|
Author: Skip Montanaro <skip@pobox.com>,
|
||||||
|
Kevin Altis <altis@semi-retired.com>,
|
||||||
|
Cliff Wells <LogiplexSoftware@earthlink.net>
|
||||||
|
Status: Draft
|
||||||
|
Type: Informational
|
||||||
|
Content-Type: text/x-rst
|
||||||
|
Created: 26-Jan-2003
|
||||||
|
Post-History:
|
||||||
|
|
||||||
|
|
||||||
|
Abstract
|
||||||
|
========
|
||||||
|
|
||||||
|
The Comma Separated Values (CSV) file format is the most common import
|
||||||
|
and export format for spreadsheets and databases. Although many CSV
|
||||||
|
files are simple to parse, the format is not formally defined by a
|
||||||
|
stable specification and is subtle enough that parsing lines of a CSV
|
||||||
|
file with something like ``line.split(",")`` is bound to fail. This
|
||||||
|
PEP defines an API for reading and writing CSV files which should make
|
||||||
|
it possible for programmers to select a CSV module which meets their
|
||||||
|
requirements.
|
||||||
|
|
||||||
|
|
||||||
|
Existing Modules
|
||||||
|
================
|
||||||
|
|
||||||
|
Three widely available modules enable programmers to read and write
|
||||||
|
CSV files:
|
||||||
|
|
||||||
|
- Object Craft's CSV module [1]_
|
||||||
|
|
||||||
|
- Cliff Wells's Python-DSV module [2]_
|
||||||
|
|
||||||
|
- Laurence Tratt's ASV module [3]_
|
||||||
|
|
||||||
|
Each has a different API, making it somewhat difficult for programmers
|
||||||
|
to switch between them. More of a problem may be that they interpret
|
||||||
|
some of the CSV corner cases differently, so even after surmounting
|
||||||
|
the differences in the module APIs, the programmer has to also deal
|
||||||
|
with semantic differences between the packages.
|
||||||
|
|
||||||
|
|
||||||
|
Rationale
|
||||||
|
=========
|
||||||
|
|
||||||
|
By defining common APIs for reading and writing CSV files, we make it
|
||||||
|
easier for programmers to choose an appropriate module to suit their
|
||||||
|
needs, and make it easier to switch between modules if their needs
|
||||||
|
change. This PEP also forms a set of requirements for creation of a
|
||||||
|
module which will hopefully be incorporated into the Python
|
||||||
|
distribution.
|
||||||
|
|
||||||
|
|
||||||
|
Module Interface
|
||||||
|
================
|
||||||
|
|
||||||
|
The module supports two basic APIs, one for reading and one for
|
||||||
|
writing. The reading interface is::
|
||||||
|
|
||||||
|
reader(fileobj [, dialect='excel2000']
|
||||||
|
[, quotechar='"']
|
||||||
|
[, delimiter=',']
|
||||||
|
[, skipinitialspace=False])
|
||||||
|
|
||||||
|
A reader object is an iterable which takes a file-like object opened
|
||||||
|
for reading as the sole required parameter. It also accepts four
|
||||||
|
optional parameters (discussed below). Readers are typically used as
|
||||||
|
follows::
|
||||||
|
|
||||||
|
csvreader = csv.reader(file("some.csv"))
|
||||||
|
for row in csvreader:
|
||||||
|
process(row)
|
||||||
|
|
||||||
|
The writing interface is similar::
|
||||||
|
|
||||||
|
writer(fileobj [, dialect='excel2000']
|
||||||
|
[, quotechar='"']
|
||||||
|
[, delimiter=',']
|
||||||
|
[, skipinitialspace=False])
|
||||||
|
|
||||||
|
A writer object is a wrapper around a file-like object opened for
|
||||||
|
writing. It accepts the same four optional parameters as the reader
|
||||||
|
constructor. Writers are typically used as follows::
|
||||||
|
|
||||||
|
csvwriter = csv.writer(file("some.csv", "w"))
|
||||||
|
for row in someiterable:
|
||||||
|
csvwriter.write(row)
|
||||||
|
|
||||||
|
|
||||||
|
Optional Parameters
|
||||||
|
-------------------
|
||||||
|
|
||||||
|
Both the reader and writer constructors take four optional keyword
|
||||||
|
parameters:
|
||||||
|
|
||||||
|
- dialect is an easy way of specifying a complete set of format
|
||||||
|
constraints for a reader or writer. Most people will know what
|
||||||
|
application generated a CSV file or what application will process
|
||||||
|
the CSV file they are generating, but not the precise settings
|
||||||
|
necessary. The only dialect defined initially is "excel2000". The
|
||||||
|
dialect parameter is interpreted in a case-insensitive manner.
|
||||||
|
|
||||||
|
- quotechar specifies a one-character string to use as the quoting
|
||||||
|
character. It defaults to '"'.
|
||||||
|
|
||||||
|
- delimiter specifies a one-character string to use as the field
|
||||||
|
separator. It defaults to ','.
|
||||||
|
|
||||||
|
- skipinitialspace specifies how to interpret whitespace which
|
||||||
|
immediately follows a delimiter. It defaults to False, which means
|
||||||
|
that whitespace immediate following a delimiter is part of the
|
||||||
|
following field.
|
||||||
|
|
||||||
|
When processing a dialect setting and one or more of the other
|
||||||
|
optional parameters, the dialect parameter is processed first, then
|
||||||
|
the others are processed. This makes it easy to choose a dialect,
|
||||||
|
then override one or more of the settings. For example, if a CSV file
|
||||||
|
was generated by Excel 2000 using single quotes as the quote
|
||||||
|
character and TAB as the delimiter, you could create a reader like::
|
||||||
|
|
||||||
|
csvreader = csv.reader(file("some.csv"), dialect="excel2000",
|
||||||
|
quotechar="'", delimiter='\t')
|
||||||
|
|
||||||
|
Other details of how Excel generates CSV files would be handled
|
||||||
|
automatically.
|
||||||
|
|
||||||
|
|
||||||
|
Testing
|
||||||
|
=======
|
||||||
|
|
||||||
|
TBD.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Issues
|
||||||
|
======
|
||||||
|
|
||||||
|
- Should a parameter control how consecutive delimiters are
|
||||||
|
interpreted? Our thought is "no". Consecutive delimiters should
|
||||||
|
always denote an empty field.
|
||||||
|
|
||||||
|
- What about Unicode? Is it sufficient to pass a file object gotten
|
||||||
|
from codecs.open()? For example::
|
||||||
|
|
||||||
|
csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
|
||||||
|
|
||||||
|
csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
|
||||||
|
|
||||||
|
In the first example, text would be assumed to be encoded as cp1252.
|
||||||
|
Should the system be aggressive in converting to Unicode or should
|
||||||
|
Unicode strings only be returned if necessary?
|
||||||
|
|
||||||
|
In the second example, the file will take care of automatically
|
||||||
|
encoding Unicode strings as utf-8 before writing to disk.
|
||||||
|
|
||||||
|
- What about alternate escape conventions? When Excel exports a file,
|
||||||
|
it appears only the field delimiter needs to be escaped. It
|
||||||
|
accomplishes this by quoting the entire field, then doubling any
|
||||||
|
quote characters which appear in the field. It also quotes a field
|
||||||
|
if the first character is a quote character. It would seem we need
|
||||||
|
to support two modes: escape-by-quoting and escape-by-prefix. In
|
||||||
|
addition, for the second mode, we'd have to specify the escape
|
||||||
|
character (presumably defaulting to a backslash character).
|
||||||
|
|
||||||
|
- Should there be a "fully quoted" mode for writing? What about
|
||||||
|
"fully quoted except for numeric values"?
|
||||||
|
|
||||||
|
- What about end-of-line? If I generate a CSV file on a Unix system,
|
||||||
|
will Excel properly recognize the LF-only line terminators?
|
||||||
|
|
||||||
|
- What about conversion to other file formats? Is the list-of-lists
|
||||||
|
output from the csvreader sufficient to feed into other writers?
|
||||||
|
|
||||||
|
- What about an option to generate list-of-dict output from the reader
|
||||||
|
and accept list-of-dicts by the writer? This makes manipulating
|
||||||
|
individual rows easier since each one is independent, but you lose
|
||||||
|
field order when writing and have to tell the writer object the
|
||||||
|
order the fields should appear in the file.
|
||||||
|
|
||||||
|
- Are quote character and delimiters limited to single characters? I
|
||||||
|
had a client not that long ago who wrote their own flat file format
|
||||||
|
with a delimiter of ":::".
|
||||||
|
|
||||||
|
- How should rows of different lengths be handled? The options seem
|
||||||
|
to be:
|
||||||
|
|
||||||
|
* raise an exception when a row is encountered whose length differs
|
||||||
|
from the previous row
|
||||||
|
|
||||||
|
* silently return short rows
|
||||||
|
|
||||||
|
* allow the caller to specify the desired row length and what to do
|
||||||
|
when rows of a different length are encountered: ignore, truncate,
|
||||||
|
pad, raise exception, etc.
|
||||||
|
|
||||||
|
|
||||||
|
References
|
||||||
|
==========
|
||||||
|
|
||||||
|
.. [1] csv module, Object Craft
|
||||||
|
(http://www.object-craft.com.au/projects/csv)
|
||||||
|
|
||||||
|
.. [2] Python-DSV module, Wells
|
||||||
|
(http://sourceforge.net/projects/python-dsv/)
|
||||||
|
|
||||||
|
.. [3] ASV module, Tratt
|
||||||
|
(http://tratt.net/laurie/python/asv/)
|
||||||
|
|
||||||
|
There are many references to other CSV-related projects on the Web. A
|
||||||
|
few are included here.
|
||||||
|
|
||||||
|
|
||||||
|
Copyright
|
||||||
|
=========
|
||||||
|
|
||||||
|
This document has been placed in the public domain.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
..
|
||||||
|
Local Variables:
|
||||||
|
mode: indented-text
|
||||||
|
indent-tabs-mode: nil
|
||||||
|
sentence-end-double-space: t
|
||||||
|
fill-column: 70
|
||||||
|
End:
|
Loading…
Reference in New Issue