From 935a1711b369b4cdd7a8d573337388b140f3f9e3 Mon Sep 17 00:00:00 2001 From: David Goodger Date: Wed, 29 Jan 2003 04:20:19 +0000 Subject: [PATCH] Added PEP 305, CSV file API --- pep-0000.txt | 8 +- pep-0305.txt | 230 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 236 insertions(+), 2 deletions(-) create mode 100644 pep-0305.txt diff --git a/pep-0000.txt b/pep-0000.txt index 30f970ca9..246651bbf 100644 --- a/pep-0000.txt +++ b/pep-0000.txt @@ -105,7 +105,8 @@ Index by Category S 301 Package Index and Metadata for Distutils Jones S 302 New Import Hooks JvR S 303 Extend divmod() for Multiple Divisors Bellman - S 304 Controlling generation of bytecode files Montanaro + S 304 Controlling Generation of Bytecode Files Montanaro + I 305 CSV File API Montanaro, Altis, Wells Finished PEPs (done, implemented in CVS) @@ -299,7 +300,8 @@ Numerical Index S 301 Package Index and Metadata for Distutils Jones S 302 New Import Hooks JvR S 303 Extend divmod() for Multiple Divisors Bellman - S 304 Controlling generation of bytecode files Montanaro + S 304 Controlling Generation of Bytecode Files Montanaro + I 305 CSV File API Montanaro, Altis, Wells SR 666 Reject Foolish Indentation Creighton @@ -320,6 +322,7 @@ Owners Aahz aahz@pobox.com Ahlstrom, James C. jim@interet.com Althoff, Jim james_althoff@i2.com + Altis, Kevin altis@semi-retired.com Ascher, David davida@activestate.com Barrett, Paul barrett@stsci.edu Baxter, Anthony anthony@interlink.com.au @@ -372,6 +375,7 @@ Owners Stein, Greg gstein@lyra.org Tirosh, Oren oren at hishome.net Warsaw, Barry barry@zope.com + Wells, Cliff LogiplexSoftware@earthlink.net Wilson, Greg gvwilson@ddj.com Wouters, Thomas thomas@xs4all.net Yee, Ka-Ping ping@lfw.org diff --git a/pep-0305.txt b/pep-0305.txt new file mode 100644 index 000000000..3f90443d5 --- /dev/null +++ b/pep-0305.txt @@ -0,0 +1,230 @@ +PEP: 305 +Title: CSV file API +Version: $Revision$ +Last-Modified: $Date$ +Author: Skip Montanaro , + Kevin Altis , + Cliff Wells +Status: Draft +Type: Informational +Content-Type: text/x-rst +Created: 26-Jan-2003 +Post-History: + + +Abstract +======== + +The Comma Separated Values (CSV) file format is the most common import +and export format for spreadsheets and databases. Although many CSV +files are simple to parse, the format is not formally defined by a +stable specification and is subtle enough that parsing lines of a CSV +file with something like ``line.split(",")`` is bound to fail. This +PEP defines an API for reading and writing CSV files which should make +it possible for programmers to select a CSV module which meets their +requirements. + + +Existing Modules +================ + +Three widely available modules enable programmers to read and write +CSV files: + +- Object Craft's CSV module [1]_ + +- Cliff Wells's Python-DSV module [2]_ + +- Laurence Tratt's ASV module [3]_ + +Each has a different API, making it somewhat difficult for programmers +to switch between them. More of a problem may be that they interpret +some of the CSV corner cases differently, so even after surmounting +the differences in the module APIs, the programmer has to also deal +with semantic differences between the packages. + + +Rationale +========= + +By defining common APIs for reading and writing CSV files, we make it +easier for programmers to choose an appropriate module to suit their +needs, and make it easier to switch between modules if their needs +change. This PEP also forms a set of requirements for creation of a +module which will hopefully be incorporated into the Python +distribution. + + +Module Interface +================ + +The module supports two basic APIs, one for reading and one for +writing. The reading interface is:: + + reader(fileobj [, dialect='excel2000'] + [, quotechar='"'] + [, delimiter=','] + [, skipinitialspace=False]) + +A reader object is an iterable which takes a file-like object opened +for reading as the sole required parameter. It also accepts four +optional parameters (discussed below). Readers are typically used as +follows:: + + csvreader = csv.reader(file("some.csv")) + for row in csvreader: + process(row) + +The writing interface is similar:: + + writer(fileobj [, dialect='excel2000'] + [, quotechar='"'] + [, delimiter=','] + [, skipinitialspace=False]) + +A writer object is a wrapper around a file-like object opened for +writing. It accepts the same four optional parameters as the reader +constructor. Writers are typically used as follows:: + + csvwriter = csv.writer(file("some.csv", "w")) + for row in someiterable: + csvwriter.write(row) + + +Optional Parameters +------------------- + +Both the reader and writer constructors take four optional keyword +parameters: + +- dialect is an easy way of specifying a complete set of format + constraints for a reader or writer. Most people will know what + application generated a CSV file or what application will process + the CSV file they are generating, but not the precise settings + necessary. The only dialect defined initially is "excel2000". The + dialect parameter is interpreted in a case-insensitive manner. + +- quotechar specifies a one-character string to use as the quoting + character. It defaults to '"'. + +- delimiter specifies a one-character string to use as the field + separator. It defaults to ','. + +- skipinitialspace specifies how to interpret whitespace which + immediately follows a delimiter. It defaults to False, which means + that whitespace immediate following a delimiter is part of the + following field. + +When processing a dialect setting and one or more of the other +optional parameters, the dialect parameter is processed first, then +the others are processed. This makes it easy to choose a dialect, +then override one or more of the settings. For example, if a CSV file +was generated by Excel 2000 using single quotes as the quote +character and TAB as the delimiter, you could create a reader like:: + + csvreader = csv.reader(file("some.csv"), dialect="excel2000", + quotechar="'", delimiter='\t') + +Other details of how Excel generates CSV files would be handled +automatically. + + +Testing +======= + +TBD. + + + +Issues +====== + +- Should a parameter control how consecutive delimiters are + interpreted? Our thought is "no". Consecutive delimiters should + always denote an empty field. + +- What about Unicode? Is it sufficient to pass a file object gotten + from codecs.open()? For example:: + + csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252")) + + csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8")) + + In the first example, text would be assumed to be encoded as cp1252. + Should the system be aggressive in converting to Unicode or should + Unicode strings only be returned if necessary? + + In the second example, the file will take care of automatically + encoding Unicode strings as utf-8 before writing to disk. + +- What about alternate escape conventions? When Excel exports a file, + it appears only the field delimiter needs to be escaped. It + accomplishes this by quoting the entire field, then doubling any + quote characters which appear in the field. It also quotes a field + if the first character is a quote character. It would seem we need + to support two modes: escape-by-quoting and escape-by-prefix. In + addition, for the second mode, we'd have to specify the escape + character (presumably defaulting to a backslash character). + +- Should there be a "fully quoted" mode for writing? What about + "fully quoted except for numeric values"? + +- What about end-of-line? If I generate a CSV file on a Unix system, + will Excel properly recognize the LF-only line terminators? + +- What about conversion to other file formats? Is the list-of-lists + output from the csvreader sufficient to feed into other writers? + +- What about an option to generate list-of-dict output from the reader + and accept list-of-dicts by the writer? This makes manipulating + individual rows easier since each one is independent, but you lose + field order when writing and have to tell the writer object the + order the fields should appear in the file. + +- Are quote character and delimiters limited to single characters? I + had a client not that long ago who wrote their own flat file format + with a delimiter of ":::". + +- How should rows of different lengths be handled? The options seem + to be: + + * raise an exception when a row is encountered whose length differs + from the previous row + + * silently return short rows + + * allow the caller to specify the desired row length and what to do + when rows of a different length are encountered: ignore, truncate, + pad, raise exception, etc. + + +References +========== + +.. [1] csv module, Object Craft + (http://www.object-craft.com.au/projects/csv) + +.. [2] Python-DSV module, Wells + (http://sourceforge.net/projects/python-dsv/) + +.. [3] ASV module, Tratt + (http://tratt.net/laurie/python/asv/) + +There are many references to other CSV-related projects on the Web. A +few are included here. + + +Copyright +========= + +This document has been placed in the public domain. + + + +.. + Local Variables: + mode: indented-text + indent-tabs-mode: nil + sentence-end-double-space: t + fill-column: 70 + End: