2003-01-28 23:20:19 -05:00
|
|
|
|
PEP: 305
|
2003-01-28 23:23:26 -05:00
|
|
|
|
Title: CSV File API
|
2003-01-28 23:20:19 -05:00
|
|
|
|
Version: $Revision$
|
|
|
|
|
Last-Modified: $Date$
|
2003-01-31 16:55:38 -05:00
|
|
|
|
Author: Kevin Altis, Dave Cole, Andrew McNamara, Skip Montanaro, Cliff Wells
|
|
|
|
|
Discussions-To: <csv@mail.mojam.com>
|
2003-01-28 23:20:19 -05:00
|
|
|
|
Status: Draft
|
2003-01-29 08:36:59 -05:00
|
|
|
|
Type: Standards Track
|
2003-01-28 23:20:19 -05:00
|
|
|
|
Content-Type: text/x-rst
|
|
|
|
|
Created: 26-Jan-2003
|
2003-01-31 16:49:32 -05:00
|
|
|
|
Post-History: 31-Jan-2003
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Abstract
|
|
|
|
|
========
|
|
|
|
|
|
|
|
|
|
The Comma Separated Values (CSV) file format is the most common import
|
|
|
|
|
and export format for spreadsheets and databases. Although many CSV
|
|
|
|
|
files are simple to parse, the format is not formally defined by a
|
|
|
|
|
stable specification and is subtle enough that parsing lines of a CSV
|
|
|
|
|
file with something like ``line.split(",")`` is bound to fail. This
|
|
|
|
|
PEP defines an API for reading and writing CSV files which should make
|
|
|
|
|
it possible for programmers to select a CSV module which meets their
|
2003-01-31 16:49:32 -05:00
|
|
|
|
requirements. It is accompanied by a corresponding module which
|
|
|
|
|
implements the API.
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
|
2003-01-30 07:53:40 -05:00
|
|
|
|
To Do (Notes for the Interested and Ambitious)
|
|
|
|
|
==============================================
|
|
|
|
|
|
|
|
|
|
- Need to better explain the advantages of a purpose-built csv module
|
|
|
|
|
over the simple ",".join() and [].split() approach.
|
|
|
|
|
|
|
|
|
|
- Need to complete initial list of formatting parameters and settle on
|
|
|
|
|
names.
|
|
|
|
|
|
|
|
|
|
- Better motivation for the choice of passing a file object to the
|
|
|
|
|
constructors. See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html
|
|
|
|
|
|
|
|
|
|
|
2003-01-28 23:20:19 -05:00
|
|
|
|
Existing Modules
|
|
|
|
|
================
|
|
|
|
|
|
|
|
|
|
Three widely available modules enable programmers to read and write
|
|
|
|
|
CSV files:
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- Object Craft's CSV module [2]_
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- Cliff Wells' Python-DSV module [3]_
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- Laurence Tratt's ASV module [4]_
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
Each has a different API, making it somewhat difficult for programmers
|
|
|
|
|
to switch between them. More of a problem may be that they interpret
|
|
|
|
|
some of the CSV corner cases differently, so even after surmounting
|
|
|
|
|
the differences in the module APIs, the programmer has to also deal
|
|
|
|
|
with semantic differences between the packages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rationale
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
By defining common APIs for reading and writing CSV files, we make it
|
|
|
|
|
easier for programmers to choose an appropriate module to suit their
|
|
|
|
|
needs, and make it easier to switch between modules if their needs
|
|
|
|
|
change. This PEP also forms a set of requirements for creation of a
|
|
|
|
|
module which will hopefully be incorporated into the Python
|
|
|
|
|
distribution.
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
CSV formats are not well-defined and different implementations have a
|
|
|
|
|
number of subtle corner cases. It has been suggested that the "V" in
|
|
|
|
|
the acronym stands for "Vague" instead of "Values". Different
|
|
|
|
|
delimiters and quoting characters are just the start. Some programs
|
|
|
|
|
generate whitespace after the delimiter. Others quote embedded
|
|
|
|
|
quoting characters by doubling them or prefixing them with an escape
|
|
|
|
|
character. The list of weird ways to do things seems nearly endless.
|
|
|
|
|
|
|
|
|
|
Unfortunately, all this variability and subtlety means it is difficult
|
|
|
|
|
for programmers to reliably parse CSV files from many sources or
|
|
|
|
|
generate CSV files designed to be fed to specific external programs
|
|
|
|
|
without deep knowledge of those sources and programs. This PEP and
|
|
|
|
|
the software which accompany it attempt to make the process less
|
|
|
|
|
fragile.
|
|
|
|
|
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
Module Interface
|
|
|
|
|
================
|
|
|
|
|
|
|
|
|
|
The module supports two basic APIs, one for reading and one for
|
2003-01-29 09:09:45 -05:00
|
|
|
|
writing. The basic reading interface is::
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-02-02 07:25:23 -05:00
|
|
|
|
obj = reader(iterable [, dialect='excel2000']
|
2003-01-31 16:49:32 -05:00
|
|
|
|
[optional keyword args])
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-02-02 07:25:23 -05:00
|
|
|
|
A reader object is an iterable which takes an interable object which
|
|
|
|
|
returns lines as the sole required parameter. The optional dialect
|
2003-01-30 07:53:40 -05:00
|
|
|
|
parameter is discussed below. It also accepts several optional
|
|
|
|
|
keyword arguments which define specific format settings for the parser
|
|
|
|
|
(see the section "Formatting Parameters"). Readers are typically used
|
|
|
|
|
as follows::
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
csvreader = csv.reader(file("some.csv"))
|
|
|
|
|
for row in csvreader:
|
|
|
|
|
process(row)
|
|
|
|
|
|
|
|
|
|
The writing interface is similar::
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq]
|
|
|
|
|
[optional keyword args])
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
A writer object is a wrapper around a file-like object opened for
|
2003-01-30 07:53:40 -05:00
|
|
|
|
writing. It accepts the same optional keyword parameters as the
|
|
|
|
|
reader constructor. In addition, it accepts an optional fieldnames
|
2003-01-31 16:49:32 -05:00
|
|
|
|
argument. This is a sequence that defines the order of fields in the
|
2003-01-29 09:09:45 -05:00
|
|
|
|
output file. It allows the write() method to accept mapping objects
|
|
|
|
|
as well as sequence objects.
|
|
|
|
|
|
|
|
|
|
Writers are typically used as follows::
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
csvwriter = csv.writer(file("some.csv", "w"))
|
|
|
|
|
for row in someiterable:
|
|
|
|
|
csvwriter.write(row)
|
|
|
|
|
|
2003-01-29 09:09:45 -05:00
|
|
|
|
To generate a set of field names as the first row of the CSV file, the
|
|
|
|
|
programmer must explicitly write it, e.g.::
|
|
|
|
|
|
|
|
|
|
csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
|
|
|
|
|
csvwriter.write(names)
|
|
|
|
|
for row in someiterable:
|
|
|
|
|
csvwriter.write(row)
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
or arrange for it to be the first row in the iterable being written.
|
|
|
|
|
|
2003-01-29 09:09:45 -05:00
|
|
|
|
|
|
|
|
|
Dialects
|
|
|
|
|
--------
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-29 09:09:45 -05:00
|
|
|
|
Readers and writers support a dialect argument which is just a
|
2003-01-30 07:11:27 -05:00
|
|
|
|
convenient handle on a group of lower level parameters.
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
When dialect is a string it identifies one of the dialects which is
|
2003-01-30 07:11:27 -05:00
|
|
|
|
known to the module, otherwise it is processed as a dialect class as
|
|
|
|
|
described below.
|
2003-01-31 16:49:32 -05:00
|
|
|
|
|
2003-01-29 09:09:45 -05:00
|
|
|
|
Dialects will generally be named after applications or organizations
|
|
|
|
|
which define specific sets of format constraints. The initial dialect
|
2003-01-31 16:49:32 -05:00
|
|
|
|
is "excel", which describes the format constraints of Excel 97 and
|
|
|
|
|
Excel 2000 regarding CSV input and output. Another possible dialect
|
|
|
|
|
(used here only as an example) might be "gnumeric".
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
Dialects are implemented as attribute only classes to enable users to
|
|
|
|
|
construct variant dialects by subclassing. The "excel" dialect is
|
2003-01-30 07:11:27 -05:00
|
|
|
|
implemented as follows::
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
class excel:
|
2003-01-30 07:11:27 -05:00
|
|
|
|
delimiter = ','
|
2003-02-02 07:25:23 -05:00
|
|
|
|
quotechar = '"'
|
2003-01-30 07:11:27 -05:00
|
|
|
|
escapechar = None
|
2003-02-02 07:25:23 -05:00
|
|
|
|
doublequote = True
|
2003-01-30 07:11:27 -05:00
|
|
|
|
skipinitialspace = False
|
|
|
|
|
lineterminator = '\r\n'
|
2003-01-30 08:34:29 -05:00
|
|
|
|
quoting = QUOTE_MINIMAL
|
2003-01-30 07:11:27 -05:00
|
|
|
|
|
|
|
|
|
An excel tab separated dialect can then be defined in user code as
|
|
|
|
|
follows::
|
|
|
|
|
|
|
|
|
|
class exceltsv(csv.excel2000):
|
|
|
|
|
delimiter = '\t'
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
Three functions are defined in the API to set, get and list dialects::
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-30 07:11:27 -05:00
|
|
|
|
set_dialect(name, dialect)
|
|
|
|
|
dialect = get_dialect(name)
|
2003-01-31 16:49:32 -05:00
|
|
|
|
known_dialects = list_dialects()
|
2003-01-29 09:09:45 -05:00
|
|
|
|
|
2003-01-30 07:11:27 -05:00
|
|
|
|
The dialect parameter is a class or instance whose attributes are the
|
2003-01-31 16:49:32 -05:00
|
|
|
|
formatting parameters defined in the next section. The
|
|
|
|
|
list_dialects() function returns all the registered dialect names as
|
|
|
|
|
given in previous set_dialect() calls (both predefined and
|
|
|
|
|
user-defined).
|
2003-01-29 09:09:45 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Formatting Parameters
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
Both the reader and writer constructors take several specific
|
|
|
|
|
formatting parameters, specified as keyword parameters. The
|
|
|
|
|
parameters are also the keys for the input and output mapping objects
|
|
|
|
|
for the set_dialect() and get_dialect() module functions.
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- ``quotechar`` specifies a one-character string to use as the quoting
|
2003-02-02 07:25:23 -05:00
|
|
|
|
character. It defaults to '"'. Setting this to None has the same
|
|
|
|
|
effect as setting quoting to csv.QUOTE_NONE.
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- ``delimiter`` specifies a one-character string to use as the field
|
2003-01-28 23:20:19 -05:00
|
|
|
|
separator. It defaults to ','.
|
|
|
|
|
|
2003-02-02 07:25:23 -05:00
|
|
|
|
- ``escapechar`` specifies a one-character string used to escape the
|
2003-01-30 07:11:27 -05:00
|
|
|
|
delimiter when quotechar is set to None.
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- ``skipinitialspace`` specifies how to interpret whitespace which
|
2003-01-28 23:20:19 -05:00
|
|
|
|
immediately follows a delimiter. It defaults to False, which means
|
2003-01-31 16:49:32 -05:00
|
|
|
|
that whitespace immediately following a delimiter is part of the
|
2003-01-28 23:20:19 -05:00
|
|
|
|
following field.
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- ``lineterminator`` specifies the character sequence which should
|
2003-01-29 09:09:45 -05:00
|
|
|
|
terminate rows.
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- ``quoting`` controls when quotes should be generated by the
|
|
|
|
|
writer. It can take on any of the following module constants::
|
|
|
|
|
|
|
|
|
|
csv.QUOTE_MINIMAL means only when required, for example, when a
|
|
|
|
|
field contains either the quotechar or the delimiter
|
2003-01-30 07:11:27 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
csv.QUOTE_ALL means that quotes are always placed around fields.
|
2003-01-30 07:11:27 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
csv.QUOTE_NONNUMERIC means that quotes are always placed around
|
|
|
|
|
fields which contain characters other than [+-0-9.].
|
2003-01-30 07:11:27 -05:00
|
|
|
|
|
2003-02-02 07:25:23 -05:00
|
|
|
|
csv.QUOTE_NONE means that quotes are never placed around
|
|
|
|
|
fields.
|
|
|
|
|
|
|
|
|
|
- ``doublequote`` controls the handling of quotes inside fields. When
|
|
|
|
|
True two consecutive quotes are interpreted as one during read, and
|
|
|
|
|
when writing, each quote is written as two quotes.
|
2003-01-30 07:11:27 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
- are there more to come?
|
2003-01-29 09:09:45 -05:00
|
|
|
|
|
2003-01-28 23:20:19 -05:00
|
|
|
|
When processing a dialect setting and one or more of the other
|
|
|
|
|
optional parameters, the dialect parameter is processed first, then
|
|
|
|
|
the others are processed. This makes it easy to choose a dialect,
|
2003-01-31 16:49:32 -05:00
|
|
|
|
then override one or more of the settings without defining a new
|
|
|
|
|
dialect class. For example, if a CSV file was generated by Excel 2000
|
|
|
|
|
using single quotes as the quote character and TAB as the delimiter,
|
|
|
|
|
you could create a reader like::
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
csvreader = csv.reader(file("some.csv"), dialect="excel",
|
2003-01-28 23:20:19 -05:00
|
|
|
|
quotechar="'", delimiter='\t')
|
|
|
|
|
|
|
|
|
|
Other details of how Excel generates CSV files would be handled
|
|
|
|
|
automatically.
|
|
|
|
|
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
Implementation
|
|
|
|
|
==============
|
|
|
|
|
|
|
|
|
|
There is a sample implementation available. [1]_ The goal is for it
|
|
|
|
|
to efficiently implement the API described in the PEP. It is heavily
|
|
|
|
|
based on the Object Craft csv module. [2]_
|
|
|
|
|
|
|
|
|
|
|
2003-01-28 23:20:19 -05:00
|
|
|
|
Testing
|
|
|
|
|
=======
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
The sample implementation [1]_ includes a set of test cases.
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Issues
|
|
|
|
|
======
|
|
|
|
|
|
|
|
|
|
- Should a parameter control how consecutive delimiters are
|
|
|
|
|
interpreted? Our thought is "no". Consecutive delimiters should
|
|
|
|
|
always denote an empty field.
|
|
|
|
|
|
|
|
|
|
- What about Unicode? Is it sufficient to pass a file object gotten
|
|
|
|
|
from codecs.open()? For example::
|
|
|
|
|
|
|
|
|
|
csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
|
|
|
|
|
|
|
|
|
|
csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
|
|
|
|
|
|
|
|
|
|
In the first example, text would be assumed to be encoded as cp1252.
|
|
|
|
|
Should the system be aggressive in converting to Unicode or should
|
|
|
|
|
Unicode strings only be returned if necessary?
|
|
|
|
|
|
|
|
|
|
In the second example, the file will take care of automatically
|
|
|
|
|
encoding Unicode strings as utf-8 before writing to disk.
|
|
|
|
|
|
|
|
|
|
- What about alternate escape conventions? When Excel exports a file,
|
|
|
|
|
it appears only the field delimiter needs to be escaped. It
|
|
|
|
|
accomplishes this by quoting the entire field, then doubling any
|
|
|
|
|
quote characters which appear in the field. It also quotes a field
|
|
|
|
|
if the first character is a quote character. It would seem we need
|
|
|
|
|
to support two modes: escape-by-quoting and escape-by-prefix. In
|
|
|
|
|
addition, for the second mode, we'd have to specify the escape
|
|
|
|
|
character (presumably defaulting to a backslash character).
|
|
|
|
|
|
|
|
|
|
- Should there be a "fully quoted" mode for writing? What about
|
|
|
|
|
"fully quoted except for numeric values"?
|
|
|
|
|
|
|
|
|
|
- What about end-of-line? If I generate a CSV file on a Unix system,
|
|
|
|
|
will Excel properly recognize the LF-only line terminators?
|
|
|
|
|
|
|
|
|
|
- What about conversion to other file formats? Is the list-of-lists
|
|
|
|
|
output from the csvreader sufficient to feed into other writers?
|
|
|
|
|
|
|
|
|
|
- What about an option to generate list-of-dict output from the reader
|
|
|
|
|
and accept list-of-dicts by the writer? This makes manipulating
|
|
|
|
|
individual rows easier since each one is independent, but you lose
|
|
|
|
|
field order when writing and have to tell the writer object the
|
|
|
|
|
order the fields should appear in the file.
|
|
|
|
|
|
|
|
|
|
- Are quote character and delimiters limited to single characters? I
|
|
|
|
|
had a client not that long ago who wrote their own flat file format
|
|
|
|
|
with a delimiter of ":::".
|
|
|
|
|
|
|
|
|
|
- How should rows of different lengths be handled? The options seem
|
|
|
|
|
to be:
|
|
|
|
|
|
|
|
|
|
* raise an exception when a row is encountered whose length differs
|
|
|
|
|
from the previous row
|
|
|
|
|
|
|
|
|
|
* silently return short rows
|
|
|
|
|
|
|
|
|
|
* allow the caller to specify the desired row length and what to do
|
|
|
|
|
when rows of a different length are encountered: ignore, truncate,
|
|
|
|
|
pad, raise exception, etc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
References
|
|
|
|
|
==========
|
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
.. [1] csv module, Python Sandbox
|
|
|
|
|
(http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
|
|
|
|
|
|
|
|
|
|
.. [2] csv module, Object Craft
|
|
|
|
|
(http://www.object-craft.com.au/projects/csv)
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
.. [3] Python-DSV module, Wells
|
|
|
|
|
(http://sourceforge.net/projects/python-dsv/)
|
2003-01-28 23:20:19 -05:00
|
|
|
|
|
2003-01-31 16:49:32 -05:00
|
|
|
|
.. [4] ASV module, Tratt
|
2003-01-28 23:20:19 -05:00
|
|
|
|
(http://tratt.net/laurie/python/asv/)
|
|
|
|
|
|
|
|
|
|
There are many references to other CSV-related projects on the Web. A
|
|
|
|
|
few are included here.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Copyright
|
|
|
|
|
=========
|
|
|
|
|
|
|
|
|
|
This document has been placed in the public domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
..
|
|
|
|
|
Local Variables:
|
|
|
|
|
mode: indented-text
|
|
|
|
|
indent-tabs-mode: nil
|
|
|
|
|
sentence-end-double-space: t
|
|
|
|
|
fill-column: 70
|
|
|
|
|
End:
|