various cleanups
expanded Rationale a tad added Post-History date (announcing it in a moment) added pointer to sandbox implementation mentioned implementation in the (massive ;-) Testing section
This commit is contained in:
parent
1ff1f735ff
commit
1cf7aea11c
115
pep-0305.txt
115
pep-0305.txt
|
@ -11,7 +11,7 @@ Status: Draft
|
||||||
Type: Standards Track
|
Type: Standards Track
|
||||||
Content-Type: text/x-rst
|
Content-Type: text/x-rst
|
||||||
Created: 26-Jan-2003
|
Created: 26-Jan-2003
|
||||||
Post-History:
|
Post-History: 31-Jan-2003
|
||||||
|
|
||||||
|
|
||||||
Abstract
|
Abstract
|
||||||
|
@ -24,7 +24,8 @@ stable specification and is subtle enough that parsing lines of a CSV
|
||||||
file with something like ``line.split(",")`` is bound to fail. This
|
file with something like ``line.split(",")`` is bound to fail. This
|
||||||
PEP defines an API for reading and writing CSV files which should make
|
PEP defines an API for reading and writing CSV files which should make
|
||||||
it possible for programmers to select a CSV module which meets their
|
it possible for programmers to select a CSV module which meets their
|
||||||
requirements.
|
requirements. It is accompanied by a corresponding module which
|
||||||
|
implements the API.
|
||||||
|
|
||||||
|
|
||||||
To Do (Notes for the Interested and Ambitious)
|
To Do (Notes for the Interested and Ambitious)
|
||||||
|
@ -46,11 +47,11 @@ Existing Modules
|
||||||
Three widely available modules enable programmers to read and write
|
Three widely available modules enable programmers to read and write
|
||||||
CSV files:
|
CSV files:
|
||||||
|
|
||||||
- Object Craft's CSV module [1]_
|
- Object Craft's CSV module [2]_
|
||||||
|
|
||||||
- Cliff Wells's Python-DSV module [2]_
|
- Cliff Wells' Python-DSV module [3]_
|
||||||
|
|
||||||
- Laurence Tratt's ASV module [3]_
|
- Laurence Tratt's ASV module [4]_
|
||||||
|
|
||||||
Each has a different API, making it somewhat difficult for programmers
|
Each has a different API, making it somewhat difficult for programmers
|
||||||
to switch between them. More of a problem may be that they interpret
|
to switch between them. More of a problem may be that they interpret
|
||||||
|
@ -69,6 +70,21 @@ change. This PEP also forms a set of requirements for creation of a
|
||||||
module which will hopefully be incorporated into the Python
|
module which will hopefully be incorporated into the Python
|
||||||
distribution.
|
distribution.
|
||||||
|
|
||||||
|
CSV formats are not well-defined and different implementations have a
|
||||||
|
number of subtle corner cases. It has been suggested that the "V" in
|
||||||
|
the acronym stands for "Vague" instead of "Values". Different
|
||||||
|
delimiters and quoting characters are just the start. Some programs
|
||||||
|
generate whitespace after the delimiter. Others quote embedded
|
||||||
|
quoting characters by doubling them or prefixing them with an escape
|
||||||
|
character. The list of weird ways to do things seems nearly endless.
|
||||||
|
|
||||||
|
Unfortunately, all this variability and subtlety means it is difficult
|
||||||
|
for programmers to reliably parse CSV files from many sources or
|
||||||
|
generate CSV files designed to be fed to specific external programs
|
||||||
|
without deep knowledge of those sources and programs. This PEP and
|
||||||
|
the software which accompany it attempt to make the process less
|
||||||
|
fragile.
|
||||||
|
|
||||||
|
|
||||||
Module Interface
|
Module Interface
|
||||||
================
|
================
|
||||||
|
@ -76,7 +92,8 @@ Module Interface
|
||||||
The module supports two basic APIs, one for reading and one for
|
The module supports two basic APIs, one for reading and one for
|
||||||
writing. The basic reading interface is::
|
writing. The basic reading interface is::
|
||||||
|
|
||||||
reader(fileobj [, dialect='excel2000'] [optional keyword args])
|
obj = reader(fileobj [, dialect='excel2000']
|
||||||
|
[optional keyword args])
|
||||||
|
|
||||||
A reader object is an iterable which takes a file-like object opened
|
A reader object is an iterable which takes a file-like object opened
|
||||||
for reading as the sole required parameter. The optional dialect
|
for reading as the sole required parameter. The optional dialect
|
||||||
|
@ -91,13 +108,13 @@ as follows::
|
||||||
|
|
||||||
The writing interface is similar::
|
The writing interface is similar::
|
||||||
|
|
||||||
writer(fileobj [, dialect='excel2000'], [, fieldnames=list]
|
obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq]
|
||||||
[optional keyword args])
|
[optional keyword args])
|
||||||
|
|
||||||
A writer object is a wrapper around a file-like object opened for
|
A writer object is a wrapper around a file-like object opened for
|
||||||
writing. It accepts the same optional keyword parameters as the
|
writing. It accepts the same optional keyword parameters as the
|
||||||
reader constructor. In addition, it accepts an optional fieldnames
|
reader constructor. In addition, it accepts an optional fieldnames
|
||||||
argument. This is a list which defines the order of fields in the
|
argument. This is a sequence that defines the order of fields in the
|
||||||
output file. It allows the write() method to accept mapping objects
|
output file. It allows the write() method to accept mapping objects
|
||||||
as well as sequence objects.
|
as well as sequence objects.
|
||||||
|
|
||||||
|
@ -115,6 +132,8 @@ programmer must explicitly write it, e.g.::
|
||||||
for row in someiterable:
|
for row in someiterable:
|
||||||
csvwriter.write(row)
|
csvwriter.write(row)
|
||||||
|
|
||||||
|
or arrange for it to be the first row in the iterable being written.
|
||||||
|
|
||||||
|
|
||||||
Dialects
|
Dialects
|
||||||
--------
|
--------
|
||||||
|
@ -122,21 +141,21 @@ Dialects
|
||||||
Readers and writers support a dialect argument which is just a
|
Readers and writers support a dialect argument which is just a
|
||||||
convenient handle on a group of lower level parameters.
|
convenient handle on a group of lower level parameters.
|
||||||
|
|
||||||
When dialect is a string it identifies one of the dialect which is
|
When dialect is a string it identifies one of the dialects which is
|
||||||
known to the module, otherwise it is processed as a dialect class as
|
known to the module, otherwise it is processed as a dialect class as
|
||||||
described below.
|
described below.
|
||||||
|
|
||||||
Dialects will generally be named after applications or organizations
|
Dialects will generally be named after applications or organizations
|
||||||
which define specific sets of format constraints. The initial dialect
|
which define specific sets of format constraints. The initial dialect
|
||||||
is excel2000, which describes the format constraints of Excel 2000's
|
is "excel", which describes the format constraints of Excel 97 and
|
||||||
CSV format. Another possible dialect (used here only as an example)
|
Excel 2000 regarding CSV input and output. Another possible dialect
|
||||||
might be "gnumeric".
|
(used here only as an example) might be "gnumeric".
|
||||||
|
|
||||||
Dialects are implemented as attribute only classes to enable user to
|
Dialects are implemented as attribute only classes to enable users to
|
||||||
construct variant dialects by subclassing. The excel2000 dialect is
|
construct variant dialects by subclassing. The "excel" dialect is
|
||||||
implemented as follows::
|
implemented as follows::
|
||||||
|
|
||||||
class excel2000:
|
class excel:
|
||||||
quotechar = '"'
|
quotechar = '"'
|
||||||
delimiter = ','
|
delimiter = ','
|
||||||
escapechar = None
|
escapechar = None
|
||||||
|
@ -150,13 +169,17 @@ follows::
|
||||||
class exceltsv(csv.excel2000):
|
class exceltsv(csv.excel2000):
|
||||||
delimiter = '\t'
|
delimiter = '\t'
|
||||||
|
|
||||||
Two functions are defined in the API to set and retrieve dialects::
|
Three functions are defined in the API to set, get and list dialects::
|
||||||
|
|
||||||
set_dialect(name, dialect)
|
set_dialect(name, dialect)
|
||||||
dialect = get_dialect(name)
|
dialect = get_dialect(name)
|
||||||
|
known_dialects = list_dialects()
|
||||||
|
|
||||||
The dialect parameter is a class or instance whose attributes are the
|
The dialect parameter is a class or instance whose attributes are the
|
||||||
formatting parameters defined in the next section.
|
formatting parameters defined in the next section. The
|
||||||
|
list_dialects() function returns all the registered dialect names as
|
||||||
|
given in previous set_dialect() calls (both predefined and
|
||||||
|
user-defined).
|
||||||
|
|
||||||
|
|
||||||
Formatting Parameters
|
Formatting Parameters
|
||||||
|
@ -167,54 +190,65 @@ formatting parameters, specified as keyword parameters. The
|
||||||
parameters are also the keys for the input and output mapping objects
|
parameters are also the keys for the input and output mapping objects
|
||||||
for the set_dialect() and get_dialect() module functions.
|
for the set_dialect() and get_dialect() module functions.
|
||||||
|
|
||||||
- quotechar specifies a one-character string to use as the quoting
|
- ``quotechar`` specifies a one-character string to use as the quoting
|
||||||
character. It defaults to '"'.
|
character. It defaults to '"'.
|
||||||
|
|
||||||
- delimiter specifies a one-character string to use as the field
|
- ``delimiter`` specifies a one-character string to use as the field
|
||||||
separator. It defaults to ','.
|
separator. It defaults to ','.
|
||||||
|
|
||||||
- escapechar specifies a one character string used to escape the
|
- ``escapechar`` specifies a one character string used to escape the
|
||||||
delimiter when quotechar is set to None.
|
delimiter when quotechar is set to None.
|
||||||
|
|
||||||
- skipinitialspace specifies how to interpret whitespace which
|
- ``skipinitialspace`` specifies how to interpret whitespace which
|
||||||
immediately follows a delimiter. It defaults to False, which means
|
immediately follows a delimiter. It defaults to False, which means
|
||||||
that whitespace immediate following a delimiter is part of the
|
that whitespace immediately following a delimiter is part of the
|
||||||
following field.
|
following field.
|
||||||
|
|
||||||
- lineterminator specifies the character sequence which should
|
- ``lineterminator`` specifies the character sequence which should
|
||||||
terminate rows.
|
terminate rows.
|
||||||
|
|
||||||
- quoting controls when quotes should be generated by the
|
- ``quoting`` controls when quotes should be generated by the
|
||||||
writer.
|
writer. It can take on any of the following module constants::
|
||||||
|
|
||||||
"minimal" means only when required, for example, when a field
|
csv.QUOTE_MINIMAL means only when required, for example, when a
|
||||||
contains either the quotechar or the delimiter
|
field contains either the quotechar or the delimiter
|
||||||
|
|
||||||
"always" means that quotes are always placed around fields.
|
csv.QUOTE_ALL means that quotes are always placed around fields.
|
||||||
|
|
||||||
"nonnumeric" means that quotes are always placed around fields
|
csv.QUOTE_NONNUMERIC means that quotes are always placed around
|
||||||
which contain characters other than [+-0-9.].
|
fields which contain characters other than [+-0-9.].
|
||||||
|
|
||||||
... XXX More to come XXX ...
|
- ``doublequote`` (tbd)
|
||||||
|
|
||||||
|
- are there more to come?
|
||||||
|
|
||||||
When processing a dialect setting and one or more of the other
|
When processing a dialect setting and one or more of the other
|
||||||
optional parameters, the dialect parameter is processed first, then
|
optional parameters, the dialect parameter is processed first, then
|
||||||
the others are processed. This makes it easy to choose a dialect,
|
the others are processed. This makes it easy to choose a dialect,
|
||||||
then override one or more of the settings. For example, if a CSV file
|
then override one or more of the settings without defining a new
|
||||||
was generated by Excel 2000 using single quotes as the quote
|
dialect class. For example, if a CSV file was generated by Excel 2000
|
||||||
character and TAB as the delimiter, you could create a reader like::
|
using single quotes as the quote character and TAB as the delimiter,
|
||||||
|
you could create a reader like::
|
||||||
|
|
||||||
csvreader = csv.reader(file("some.csv"), dialect="excel2000",
|
csvreader = csv.reader(file("some.csv"), dialect="excel",
|
||||||
quotechar="'", delimiter='\t')
|
quotechar="'", delimiter='\t')
|
||||||
|
|
||||||
Other details of how Excel generates CSV files would be handled
|
Other details of how Excel generates CSV files would be handled
|
||||||
automatically.
|
automatically.
|
||||||
|
|
||||||
|
|
||||||
|
Implementation
|
||||||
|
==============
|
||||||
|
|
||||||
|
There is a sample implementation available. [1]_ The goal is for it
|
||||||
|
to efficiently implement the API described in the PEP. It is heavily
|
||||||
|
based on the Object Craft csv module. [2]_
|
||||||
|
|
||||||
|
|
||||||
Testing
|
Testing
|
||||||
=======
|
=======
|
||||||
|
|
||||||
TBD.
|
The sample implementation [1]_ includes a set of test cases.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
@ -283,13 +317,16 @@ Issues
|
||||||
References
|
References
|
||||||
==========
|
==========
|
||||||
|
|
||||||
.. [1] csv module, Object Craft
|
.. [1] csv module, Python Sandbox
|
||||||
|
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
|
||||||
|
|
||||||
|
.. [2] csv module, Object Craft
|
||||||
(http://www.object-craft.com.au/projects/csv)
|
(http://www.object-craft.com.au/projects/csv)
|
||||||
|
|
||||||
.. [2] Python-DSV module, Wells
|
.. [3] Python-DSV module, Wells
|
||||||
(http://sourceforge.net/projects/python-dsv/)
|
(http://sourceforge.net/projects/python-dsv/)
|
||||||
|
|
||||||
.. [3] ASV module, Tratt
|
.. [4] ASV module, Tratt
|
||||||
(http://tratt.net/laurie/python/asv/)
|
(http://tratt.net/laurie/python/asv/)
|
||||||
|
|
||||||
There are many references to other CSV-related projects on the Web. A
|
There are many references to other CSV-related projects on the Web. A
|
||||||
|
|
Loading…
Reference in New Issue