various cleanups
expanded Rationale a tad added Post-History date (announcing it in a moment) added pointer to sandbox implementation mentioned implementation in the (massive ;-) Testing section
This commit is contained in:
parent
1ff1f735ff
commit
1cf7aea11c
115
pep-0305.txt
115
pep-0305.txt
|
@ -11,7 +11,7 @@ Status: Draft
|
|||
Type: Standards Track
|
||||
Content-Type: text/x-rst
|
||||
Created: 26-Jan-2003
|
||||
Post-History:
|
||||
Post-History: 31-Jan-2003
|
||||
|
||||
|
||||
Abstract
|
||||
|
@ -24,7 +24,8 @@ stable specification and is subtle enough that parsing lines of a CSV
|
|||
file with something like ``line.split(",")`` is bound to fail. This
|
||||
PEP defines an API for reading and writing CSV files which should make
|
||||
it possible for programmers to select a CSV module which meets their
|
||||
requirements.
|
||||
requirements. It is accompanied by a corresponding module which
|
||||
implements the API.
|
||||
|
||||
|
||||
To Do (Notes for the Interested and Ambitious)
|
||||
|
@ -46,11 +47,11 @@ Existing Modules
|
|||
Three widely available modules enable programmers to read and write
|
||||
CSV files:
|
||||
|
||||
- Object Craft's CSV module [1]_
|
||||
- Object Craft's CSV module [2]_
|
||||
|
||||
- Cliff Wells's Python-DSV module [2]_
|
||||
- Cliff Wells' Python-DSV module [3]_
|
||||
|
||||
- Laurence Tratt's ASV module [3]_
|
||||
- Laurence Tratt's ASV module [4]_
|
||||
|
||||
Each has a different API, making it somewhat difficult for programmers
|
||||
to switch between them. More of a problem may be that they interpret
|
||||
|
@ -69,6 +70,21 @@ change. This PEP also forms a set of requirements for creation of a
|
|||
module which will hopefully be incorporated into the Python
|
||||
distribution.
|
||||
|
||||
CSV formats are not well-defined and different implementations have a
|
||||
number of subtle corner cases. It has been suggested that the "V" in
|
||||
the acronym stands for "Vague" instead of "Values". Different
|
||||
delimiters and quoting characters are just the start. Some programs
|
||||
generate whitespace after the delimiter. Others quote embedded
|
||||
quoting characters by doubling them or prefixing them with an escape
|
||||
character. The list of weird ways to do things seems nearly endless.
|
||||
|
||||
Unfortunately, all this variability and subtlety means it is difficult
|
||||
for programmers to reliably parse CSV files from many sources or
|
||||
generate CSV files designed to be fed to specific external programs
|
||||
without deep knowledge of those sources and programs. This PEP and
|
||||
the software which accompany it attempt to make the process less
|
||||
fragile.
|
||||
|
||||
|
||||
Module Interface
|
||||
================
|
||||
|
@ -76,7 +92,8 @@ Module Interface
|
|||
The module supports two basic APIs, one for reading and one for
|
||||
writing. The basic reading interface is::
|
||||
|
||||
reader(fileobj [, dialect='excel2000'] [optional keyword args])
|
||||
obj = reader(fileobj [, dialect='excel2000']
|
||||
[optional keyword args])
|
||||
|
||||
A reader object is an iterable which takes a file-like object opened
|
||||
for reading as the sole required parameter. The optional dialect
|
||||
|
@ -91,13 +108,13 @@ as follows::
|
|||
|
||||
The writing interface is similar::
|
||||
|
||||
writer(fileobj [, dialect='excel2000'], [, fieldnames=list]
|
||||
obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq]
|
||||
[optional keyword args])
|
||||
|
||||
A writer object is a wrapper around a file-like object opened for
|
||||
writing. It accepts the same optional keyword parameters as the
|
||||
reader constructor. In addition, it accepts an optional fieldnames
|
||||
argument. This is a list which defines the order of fields in the
|
||||
argument. This is a sequence that defines the order of fields in the
|
||||
output file. It allows the write() method to accept mapping objects
|
||||
as well as sequence objects.
|
||||
|
||||
|
@ -115,6 +132,8 @@ programmer must explicitly write it, e.g.::
|
|||
for row in someiterable:
|
||||
csvwriter.write(row)
|
||||
|
||||
or arrange for it to be the first row in the iterable being written.
|
||||
|
||||
|
||||
Dialects
|
||||
--------
|
||||
|
@ -122,21 +141,21 @@ Dialects
|
|||
Readers and writers support a dialect argument which is just a
|
||||
convenient handle on a group of lower level parameters.
|
||||
|
||||
When dialect is a string it identifies one of the dialect which is
|
||||
When dialect is a string it identifies one of the dialects which is
|
||||
known to the module, otherwise it is processed as a dialect class as
|
||||
described below.
|
||||
|
||||
Dialects will generally be named after applications or organizations
|
||||
which define specific sets of format constraints. The initial dialect
|
||||
is excel2000, which describes the format constraints of Excel 2000's
|
||||
CSV format. Another possible dialect (used here only as an example)
|
||||
might be "gnumeric".
|
||||
is "excel", which describes the format constraints of Excel 97 and
|
||||
Excel 2000 regarding CSV input and output. Another possible dialect
|
||||
(used here only as an example) might be "gnumeric".
|
||||
|
||||
Dialects are implemented as attribute only classes to enable user to
|
||||
construct variant dialects by subclassing. The excel2000 dialect is
|
||||
Dialects are implemented as attribute only classes to enable users to
|
||||
construct variant dialects by subclassing. The "excel" dialect is
|
||||
implemented as follows::
|
||||
|
||||
class excel2000:
|
||||
class excel:
|
||||
quotechar = '"'
|
||||
delimiter = ','
|
||||
escapechar = None
|
||||
|
@ -150,13 +169,17 @@ follows::
|
|||
class exceltsv(csv.excel2000):
|
||||
delimiter = '\t'
|
||||
|
||||
Two functions are defined in the API to set and retrieve dialects::
|
||||
Three functions are defined in the API to set, get and list dialects::
|
||||
|
||||
set_dialect(name, dialect)
|
||||
dialect = get_dialect(name)
|
||||
known_dialects = list_dialects()
|
||||
|
||||
The dialect parameter is a class or instance whose attributes are the
|
||||
formatting parameters defined in the next section.
|
||||
formatting parameters defined in the next section. The
|
||||
list_dialects() function returns all the registered dialect names as
|
||||
given in previous set_dialect() calls (both predefined and
|
||||
user-defined).
|
||||
|
||||
|
||||
Formatting Parameters
|
||||
|
@ -167,54 +190,65 @@ formatting parameters, specified as keyword parameters. The
|
|||
parameters are also the keys for the input and output mapping objects
|
||||
for the set_dialect() and get_dialect() module functions.
|
||||
|
||||
- quotechar specifies a one-character string to use as the quoting
|
||||
- ``quotechar`` specifies a one-character string to use as the quoting
|
||||
character. It defaults to '"'.
|
||||
|
||||
- delimiter specifies a one-character string to use as the field
|
||||
- ``delimiter`` specifies a one-character string to use as the field
|
||||
separator. It defaults to ','.
|
||||
|
||||
- escapechar specifies a one character string used to escape the
|
||||
- ``escapechar`` specifies a one character string used to escape the
|
||||
delimiter when quotechar is set to None.
|
||||
|
||||
- skipinitialspace specifies how to interpret whitespace which
|
||||
- ``skipinitialspace`` specifies how to interpret whitespace which
|
||||
immediately follows a delimiter. It defaults to False, which means
|
||||
that whitespace immediate following a delimiter is part of the
|
||||
that whitespace immediately following a delimiter is part of the
|
||||
following field.
|
||||
|
||||
- lineterminator specifies the character sequence which should
|
||||
- ``lineterminator`` specifies the character sequence which should
|
||||
terminate rows.
|
||||
|
||||
- quoting controls when quotes should be generated by the
|
||||
writer.
|
||||
- ``quoting`` controls when quotes should be generated by the
|
||||
writer. It can take on any of the following module constants::
|
||||
|
||||
"minimal" means only when required, for example, when a field
|
||||
contains either the quotechar or the delimiter
|
||||
csv.QUOTE_MINIMAL means only when required, for example, when a
|
||||
field contains either the quotechar or the delimiter
|
||||
|
||||
"always" means that quotes are always placed around fields.
|
||||
csv.QUOTE_ALL means that quotes are always placed around fields.
|
||||
|
||||
"nonnumeric" means that quotes are always placed around fields
|
||||
which contain characters other than [+-0-9.].
|
||||
csv.QUOTE_NONNUMERIC means that quotes are always placed around
|
||||
fields which contain characters other than [+-0-9.].
|
||||
|
||||
... XXX More to come XXX ...
|
||||
- ``doublequote`` (tbd)
|
||||
|
||||
- are there more to come?
|
||||
|
||||
When processing a dialect setting and one or more of the other
|
||||
optional parameters, the dialect parameter is processed first, then
|
||||
the others are processed. This makes it easy to choose a dialect,
|
||||
then override one or more of the settings. For example, if a CSV file
|
||||
was generated by Excel 2000 using single quotes as the quote
|
||||
character and TAB as the delimiter, you could create a reader like::
|
||||
then override one or more of the settings without defining a new
|
||||
dialect class. For example, if a CSV file was generated by Excel 2000
|
||||
using single quotes as the quote character and TAB as the delimiter,
|
||||
you could create a reader like::
|
||||
|
||||
csvreader = csv.reader(file("some.csv"), dialect="excel2000",
|
||||
csvreader = csv.reader(file("some.csv"), dialect="excel",
|
||||
quotechar="'", delimiter='\t')
|
||||
|
||||
Other details of how Excel generates CSV files would be handled
|
||||
automatically.
|
||||
|
||||
|
||||
Implementation
|
||||
==============
|
||||
|
||||
There is a sample implementation available. [1]_ The goal is for it
|
||||
to efficiently implement the API described in the PEP. It is heavily
|
||||
based on the Object Craft csv module. [2]_
|
||||
|
||||
|
||||
Testing
|
||||
=======
|
||||
|
||||
TBD.
|
||||
The sample implementation [1]_ includes a set of test cases.
|
||||
|
||||
|
||||
|
||||
|
@ -283,13 +317,16 @@ Issues
|
|||
References
|
||||
==========
|
||||
|
||||
.. [1] csv module, Object Craft
|
||||
.. [1] csv module, Python Sandbox
|
||||
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
|
||||
|
||||
.. [2] csv module, Object Craft
|
||||
(http://www.object-craft.com.au/projects/csv)
|
||||
|
||||
.. [2] Python-DSV module, Wells
|
||||
.. [3] Python-DSV module, Wells
|
||||
(http://sourceforge.net/projects/python-dsv/)
|
||||
|
||||
.. [3] ASV module, Tratt
|
||||
.. [4] ASV module, Tratt
|
||||
(http://tratt.net/laurie/python/asv/)
|
||||
|
||||
There are many references to other CSV-related projects on the Web. A
|
||||
|
|
Loading…
Reference in New Issue