various cleanups

expanded Rationale a tad added Post-History date (announcing it in a moment) added pointer to sandbox implementation mentioned implementation in the (massive ;-) Testing section
2003-01-31 21:49:32 +00:00 · 2003-01-31 21:49:32 +00:00 · 1cf7aea11c
parent 1ff1f735ff
commit 1cf7aea11c
1 changed files with 80 additions and 43 deletions
--- a/pep-0305.txt
+++ b/pep-0305.txt
@ -11,7 +11,7 @@ Status: Draft
 Type: Standards Track
 Content-Type: text/x-rst
 Created: 26-Jan-2003
-Post-History: 
+Post-History: 31-Jan-2003


 Abstract
@ -24,7 +24,8 @@ stable specification and is subtle enough that parsing lines of a CSV
 file with something like ``line.split(",")`` is bound to fail.  This
 PEP defines an API for reading and writing CSV files which should make
 it possible for programmers to select a CSV module which meets their
-requirements.
+requirements.  It is accompanied by a corresponding module which
+implements the API.


 To Do (Notes for the Interested and Ambitious)
@ -46,11 +47,11 @@ Existing Modules
 Three widely available modules enable programmers to read and write
 CSV files:

- Object Craft's CSV module [1]_
+- Object Craft's CSV module [2]_

- Cliff Wells's Python-DSV module [2]_
+- Cliff Wells' Python-DSV module [3]_

- Laurence Tratt's ASV module [3]_
+- Laurence Tratt's ASV module [4]_

 Each has a different API, making it somewhat difficult for programmers
 to switch between them.  More of a problem may be that they interpret
@ -69,6 +70,21 @@ change.  This PEP also forms a set of requirements for creation of a
 module which will hopefully be incorporated into the Python
 distribution.

+CSV formats are not well-defined and different implementations have a
+number of subtle corner cases.  It has been suggested that the "V" in
+the acronym stands for "Vague" instead of "Values".  Different
+delimiters and quoting characters are just the start.  Some programs
+generate whitespace after the delimiter.  Others quote embedded
+quoting characters by doubling them or prefixing them with an escape
+character.  The list of weird ways to do things seems nearly endless.
+
+Unfortunately, all this variability and subtlety means it is difficult
+for programmers to reliably parse CSV files from many sources or
+generate CSV files designed to be fed to specific external programs
+without deep knowledge of those sources and programs.  This PEP and
+the software which accompany it attempt to make the process less
+fragile.
+

 Module Interface
 ================
@ -76,7 +92,8 @@ Module Interface
 The module supports two basic APIs, one for reading and one for
 writing.  The basic reading interface is::

-    reader(fileobj [, dialect='excel2000'] [optional keyword args])
+    obj = reader(fileobj [, dialect='excel2000']
+                 [optional keyword args])

 A reader object is an iterable which takes a file-like object opened
 for reading as the sole required parameter.  The optional dialect
@ -91,13 +108,13 @@ as follows::

 The writing interface is similar::

-    writer(fileobj [, dialect='excel2000'], [, fieldnames=list]
+    obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq]
                 [optional keyword args])

 A writer object is a wrapper around a file-like object opened for
 writing.  It accepts the same optional keyword parameters as the
 reader constructor.  In addition, it accepts an optional fieldnames
-argument.  This is a list which defines the order of fields in the
+argument.  This is a sequence that defines the order of fields in the
 output file.  It allows the write() method to accept mapping objects
 as well as sequence objects.

@ -115,6 +132,8 @@ programmer must explicitly write it, e.g.::
    for row in someiterable:
        csvwriter.write(row)

+or arrange for it to be the first row in the iterable being written.
+

 Dialects
 --------
@ -122,21 +141,21 @@ Dialects
 Readers and writers support a dialect argument which is just a
 convenient handle on a group of lower level parameters.

-When dialect is a string it identifies one of the dialect which is
+When dialect is a string it identifies one of the dialects which is
 known to the module, otherwise it is processed as a dialect class as
 described below.

 Dialects will generally be named after applications or organizations
 which define specific sets of format constraints.  The initial dialect
-is excel2000, which describes the format constraints of Excel 2000's
-CSV format.  Another possible dialect (used here only as an example)
-might be "gnumeric".
+is "excel", which describes the format constraints of Excel 97 and
+Excel 2000 regarding CSV input and output.  Another possible dialect
+(used here only as an example) might be "gnumeric".

-Dialects are implemented as attribute only classes to enable user to
-construct variant dialects by subclassing.  The excel2000 dialect is
+Dialects are implemented as attribute only classes to enable users to
+construct variant dialects by subclassing.  The "excel" dialect is
 implemented as follows::

-    class excel2000:
+    class excel:
        quotechar = '"'
        delimiter = ','
        escapechar = None
@ -150,13 +169,17 @@ follows::
    class exceltsv(csv.excel2000):
        delimiter = '\t'

-Two functions are defined in the API to set and retrieve dialects::
+Three functions are defined in the API to set, get and list dialects::

    set_dialect(name, dialect)
    dialect = get_dialect(name)
+    known_dialects = list_dialects()

 The dialect parameter is a class or instance whose attributes are the
-formatting parameters defined in the next section.
+formatting parameters defined in the next section.  The
+list_dialects() function returns all the registered dialect names as
+given in previous set_dialect() calls (both predefined and
+user-defined).


 Formatting Parameters
@ -167,54 +190,65 @@ formatting parameters, specified as keyword parameters.  The
 parameters are also the keys for the input and output mapping objects
 for the set_dialect() and get_dialect() module functions.

- quotechar specifies a one-character string to use as the quoting
+- ``quotechar`` specifies a one-character string to use as the quoting
  character.  It defaults to '"'.

- delimiter specifies a one-character string to use as the field
+- ``delimiter`` specifies a one-character string to use as the field
  separator.  It defaults to ','.

- escapechar specifies a one character string used to escape the
+- ``escapechar`` specifies a one character string used to escape the
  delimiter when quotechar is set to None.

- skipinitialspace specifies how to interpret whitespace which
+- ``skipinitialspace`` specifies how to interpret whitespace which
  immediately follows a delimiter.  It defaults to False, which means
-  that whitespace immediate following a delimiter is part of the
+  that whitespace immediately following a delimiter is part of the
  following field.

- lineterminator specifies the character sequence which should
+- ``lineterminator`` specifies the character sequence which should
  terminate rows.

- quoting controls when quotes should be generated by the
-  writer.
+- ``quoting`` controls when quotes should be generated by the
+  writer.  It can take on any of the following module constants::

-    "minimal" means only when required, for example, when a field
-    contains either the quotechar or the delimiter
+    csv.QUOTE_MINIMAL means only when required, for example, when a
+    field contains either the quotechar or the delimiter

-    "always" means that quotes are always placed around fields.
+    csv.QUOTE_ALL means that quotes are always placed around fields.

-    "nonnumeric" means that quotes are always placed around fields
-    which contain characters other than [+-0-9.].
+    csv.QUOTE_NONNUMERIC means that quotes are always placed around
+    fields which contain characters other than [+-0-9.].

-... XXX More to come XXX ...
+- ``doublequote`` (tbd)
+
+- are there more to come?

 When processing a dialect setting and one or more of the other
 optional parameters, the dialect parameter is processed first, then
 the others are processed.  This makes it easy to choose a dialect,
-then override one or more of the settings.  For example, if a CSV file
-was generated by Excel 2000 using single quotes as the quote
-character and TAB as the delimiter, you could create a reader like::
+then override one or more of the settings without defining a new
+dialect class.  For example, if a CSV file was generated by Excel 2000
+using single quotes as the quote character and TAB as the delimiter,
+you could create a reader like::

-    csvreader = csv.reader(file("some.csv"), dialect="excel2000",
+    csvreader = csv.reader(file("some.csv"), dialect="excel",
                           quotechar="'", delimiter='\t')

 Other details of how Excel generates CSV files would be handled
 automatically.


+Implementation
+==============
+
+There is a sample implementation available.  [1]_ The goal is for it
+to efficiently implement the API described in the PEP.  It is heavily
+based on the Object Craft csv module. [2]_
+
+
 Testing
 =======

-TBD.
+The sample implementation [1]_ includes a set of test cases.



@ -283,13 +317,16 @@ Issues
 References
 ==========

-.. [1] csv module, Object Craft
+.. [1] csv module, Python Sandbox
+   (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
+
+.. [2] csv module, Object Craft
   (http://www.object-craft.com.au/projects/csv)

-.. [2] Python-DSV module, Wells
+.. [3] Python-DSV module, Wells
   (http://sourceforge.net/projects/python-dsv/)

-.. [3] ASV module, Tratt
+.. [4] ASV module, Tratt
   (http://tratt.net/laurie/python/asv/)

 There are many references to other CSV-related projects on the Web.  A