various cleanups

expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section
This commit is contained in:
Skip Montanaro 2003-01-31 21:49:32 +00:00
parent 1ff1f735ff
commit 1cf7aea11c
1 changed files with 80 additions and 43 deletions

View File

@ -11,7 +11,7 @@ Status: Draft
Type: Standards Track Type: Standards Track
Content-Type: text/x-rst Content-Type: text/x-rst
Created: 26-Jan-2003 Created: 26-Jan-2003
Post-History: Post-History: 31-Jan-2003
Abstract Abstract
@ -24,7 +24,8 @@ stable specification and is subtle enough that parsing lines of a CSV
file with something like ``line.split(",")`` is bound to fail. This file with something like ``line.split(",")`` is bound to fail. This
PEP defines an API for reading and writing CSV files which should make PEP defines an API for reading and writing CSV files which should make
it possible for programmers to select a CSV module which meets their it possible for programmers to select a CSV module which meets their
requirements. requirements. It is accompanied by a corresponding module which
implements the API.
To Do (Notes for the Interested and Ambitious) To Do (Notes for the Interested and Ambitious)
@ -46,11 +47,11 @@ Existing Modules
Three widely available modules enable programmers to read and write Three widely available modules enable programmers to read and write
CSV files: CSV files:
- Object Craft's CSV module [1]_ - Object Craft's CSV module [2]_
- Cliff Wells's Python-DSV module [2]_ - Cliff Wells' Python-DSV module [3]_
- Laurence Tratt's ASV module [3]_ - Laurence Tratt's ASV module [4]_
Each has a different API, making it somewhat difficult for programmers Each has a different API, making it somewhat difficult for programmers
to switch between them. More of a problem may be that they interpret to switch between them. More of a problem may be that they interpret
@ -69,6 +70,21 @@ change. This PEP also forms a set of requirements for creation of a
module which will hopefully be incorporated into the Python module which will hopefully be incorporated into the Python
distribution. distribution.
CSV formats are not well-defined and different implementations have a
number of subtle corner cases. It has been suggested that the "V" in
the acronym stands for "Vague" instead of "Values". Different
delimiters and quoting characters are just the start. Some programs
generate whitespace after the delimiter. Others quote embedded
quoting characters by doubling them or prefixing them with an escape
character. The list of weird ways to do things seems nearly endless.
Unfortunately, all this variability and subtlety means it is difficult
for programmers to reliably parse CSV files from many sources or
generate CSV files designed to be fed to specific external programs
without deep knowledge of those sources and programs. This PEP and
the software which accompany it attempt to make the process less
fragile.
Module Interface Module Interface
================ ================
@ -76,7 +92,8 @@ Module Interface
The module supports two basic APIs, one for reading and one for The module supports two basic APIs, one for reading and one for
writing. The basic reading interface is:: writing. The basic reading interface is::
reader(fileobj [, dialect='excel2000'] [optional keyword args]) obj = reader(fileobj [, dialect='excel2000']
[optional keyword args])
A reader object is an iterable which takes a file-like object opened A reader object is an iterable which takes a file-like object opened
for reading as the sole required parameter. The optional dialect for reading as the sole required parameter. The optional dialect
@ -91,13 +108,13 @@ as follows::
The writing interface is similar:: The writing interface is similar::
writer(fileobj [, dialect='excel2000'], [, fieldnames=list] obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq]
[optional keyword args]) [optional keyword args])
A writer object is a wrapper around a file-like object opened for A writer object is a wrapper around a file-like object opened for
writing. It accepts the same optional keyword parameters as the writing. It accepts the same optional keyword parameters as the
reader constructor. In addition, it accepts an optional fieldnames reader constructor. In addition, it accepts an optional fieldnames
argument. This is a list which defines the order of fields in the argument. This is a sequence that defines the order of fields in the
output file. It allows the write() method to accept mapping objects output file. It allows the write() method to accept mapping objects
as well as sequence objects. as well as sequence objects.
@ -115,6 +132,8 @@ programmer must explicitly write it, e.g.::
for row in someiterable: for row in someiterable:
csvwriter.write(row) csvwriter.write(row)
or arrange for it to be the first row in the iterable being written.
Dialects Dialects
-------- --------
@ -122,21 +141,21 @@ Dialects
Readers and writers support a dialect argument which is just a Readers and writers support a dialect argument which is just a
convenient handle on a group of lower level parameters. convenient handle on a group of lower level parameters.
When dialect is a string it identifies one of the dialect which is When dialect is a string it identifies one of the dialects which is
known to the module, otherwise it is processed as a dialect class as known to the module, otherwise it is processed as a dialect class as
described below. described below.
Dialects will generally be named after applications or organizations Dialects will generally be named after applications or organizations
which define specific sets of format constraints. The initial dialect which define specific sets of format constraints. The initial dialect
is excel2000, which describes the format constraints of Excel 2000's is "excel", which describes the format constraints of Excel 97 and
CSV format. Another possible dialect (used here only as an example) Excel 2000 regarding CSV input and output. Another possible dialect
might be "gnumeric". (used here only as an example) might be "gnumeric".
Dialects are implemented as attribute only classes to enable user to Dialects are implemented as attribute only classes to enable users to
construct variant dialects by subclassing. The excel2000 dialect is construct variant dialects by subclassing. The "excel" dialect is
implemented as follows:: implemented as follows::
class excel2000: class excel:
quotechar = '"' quotechar = '"'
delimiter = ',' delimiter = ','
escapechar = None escapechar = None
@ -150,13 +169,17 @@ follows::
class exceltsv(csv.excel2000): class exceltsv(csv.excel2000):
delimiter = '\t' delimiter = '\t'
Two functions are defined in the API to set and retrieve dialects:: Three functions are defined in the API to set, get and list dialects::
set_dialect(name, dialect) set_dialect(name, dialect)
dialect = get_dialect(name) dialect = get_dialect(name)
known_dialects = list_dialects()
The dialect parameter is a class or instance whose attributes are the The dialect parameter is a class or instance whose attributes are the
formatting parameters defined in the next section. formatting parameters defined in the next section. The
list_dialects() function returns all the registered dialect names as
given in previous set_dialect() calls (both predefined and
user-defined).
Formatting Parameters Formatting Parameters
@ -167,54 +190,65 @@ formatting parameters, specified as keyword parameters. The
parameters are also the keys for the input and output mapping objects parameters are also the keys for the input and output mapping objects
for the set_dialect() and get_dialect() module functions. for the set_dialect() and get_dialect() module functions.
- quotechar specifies a one-character string to use as the quoting - ``quotechar`` specifies a one-character string to use as the quoting
character. It defaults to '"'. character. It defaults to '"'.
- delimiter specifies a one-character string to use as the field - ``delimiter`` specifies a one-character string to use as the field
separator. It defaults to ','. separator. It defaults to ','.
- escapechar specifies a one character string used to escape the - ``escapechar`` specifies a one character string used to escape the
delimiter when quotechar is set to None. delimiter when quotechar is set to None.
- skipinitialspace specifies how to interpret whitespace which - ``skipinitialspace`` specifies how to interpret whitespace which
immediately follows a delimiter. It defaults to False, which means immediately follows a delimiter. It defaults to False, which means
that whitespace immediate following a delimiter is part of the that whitespace immediately following a delimiter is part of the
following field. following field.
- lineterminator specifies the character sequence which should - ``lineterminator`` specifies the character sequence which should
terminate rows. terminate rows.
- quoting controls when quotes should be generated by the - ``quoting`` controls when quotes should be generated by the
writer. writer. It can take on any of the following module constants::
"minimal" means only when required, for example, when a field csv.QUOTE_MINIMAL means only when required, for example, when a
contains either the quotechar or the delimiter field contains either the quotechar or the delimiter
"always" means that quotes are always placed around fields. csv.QUOTE_ALL means that quotes are always placed around fields.
"nonnumeric" means that quotes are always placed around fields csv.QUOTE_NONNUMERIC means that quotes are always placed around
which contain characters other than [+-0-9.]. fields which contain characters other than [+-0-9.].
... XXX More to come XXX ... - ``doublequote`` (tbd)
- are there more to come?
When processing a dialect setting and one or more of the other When processing a dialect setting and one or more of the other
optional parameters, the dialect parameter is processed first, then optional parameters, the dialect parameter is processed first, then
the others are processed. This makes it easy to choose a dialect, the others are processed. This makes it easy to choose a dialect,
then override one or more of the settings. For example, if a CSV file then override one or more of the settings without defining a new
was generated by Excel 2000 using single quotes as the quote dialect class. For example, if a CSV file was generated by Excel 2000
character and TAB as the delimiter, you could create a reader like:: using single quotes as the quote character and TAB as the delimiter,
you could create a reader like::
csvreader = csv.reader(file("some.csv"), dialect="excel2000", csvreader = csv.reader(file("some.csv"), dialect="excel",
quotechar="'", delimiter='\t') quotechar="'", delimiter='\t')
Other details of how Excel generates CSV files would be handled Other details of how Excel generates CSV files would be handled
automatically. automatically.
Implementation
==============
There is a sample implementation available. [1]_ The goal is for it
to efficiently implement the API described in the PEP. It is heavily
based on the Object Craft csv module. [2]_
Testing Testing
======= =======
TBD. The sample implementation [1]_ includes a set of test cases.
@ -283,13 +317,16 @@ Issues
References References
========== ==========
.. [1] csv module, Object Craft .. [1] csv module, Python Sandbox
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
.. [2] csv module, Object Craft
(http://www.object-craft.com.au/projects/csv) (http://www.object-craft.com.au/projects/csv)
.. [2] Python-DSV module, Wells .. [3] Python-DSV module, Wells
(http://sourceforge.net/projects/python-dsv/) (http://sourceforge.net/projects/python-dsv/)
.. [3] ASV module, Tratt .. [4] ASV module, Tratt
(http://tratt.net/laurie/python/asv/) (http://tratt.net/laurie/python/asv/)
There are many references to other CSV-related projects on the Web. A There are many references to other CSV-related projects on the Web. A