a number of changes to keep more-or-less up-to-date.
This commit is contained in:
parent
1dc70cf85e
commit
67273248cc
398
pep-0305.txt
398
pep-0305.txt
|
@ -18,31 +18,77 @@ The Comma Separated Values (CSV) file format is the most common import
|
|||
and export format for spreadsheets and databases. Although many CSV
|
||||
files are simple to parse, the format is not formally defined by a
|
||||
stable specification and is subtle enough that parsing lines of a CSV
|
||||
file with something like ``line.split(",")`` is bound to fail. This
|
||||
PEP defines an API for reading and writing CSV files which should make
|
||||
it possible for programmers to select a CSV module which meets their
|
||||
requirements. It is accompanied by a corresponding module which
|
||||
implements the API.
|
||||
file with something like ``line.split(",")`` is eventually bound to
|
||||
fail. This PEP defines an API for reading and writing CSV files. It
|
||||
is accompanied by a corresponding module which implements the API.
|
||||
|
||||
|
||||
To Do (Notes for the Interested and Ambitious)
|
||||
==============================================
|
||||
|
||||
- Need to better explain the advantages of a purpose-built csv module
|
||||
over the simple ",".join() and [].split() approach.
|
||||
|
||||
- Need to complete initial list of formatting parameters and settle on
|
||||
names.
|
||||
|
||||
- Better motivation for the choice of passing a file object to the
|
||||
constructors. See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html
|
||||
|
||||
- Unicode. ugh.
|
||||
|
||||
|
||||
Application Domain
|
||||
==================
|
||||
|
||||
This PEP is about doing one thing well: parsing tabular data which may
|
||||
use a variety of field separators, quoting characters, quote escape
|
||||
mechanisms and line endings. The authors intend the proposed module
|
||||
to solve this one parsing problem efficiently. The authors do not
|
||||
intend to address any of these related topics::
|
||||
|
||||
- data interpretation (is a field containing the string "10"
|
||||
supposed to be a string, a float or an int? is it a number in
|
||||
base 10, base 16 or base 2? is a number in quotes a number or a
|
||||
string?)
|
||||
|
||||
- locale-specific data representation (should the number 1.23 be
|
||||
written as "1.23" or "1,23" or "1 23"?) -- this may eventually
|
||||
be addressed.
|
||||
|
||||
- fixed width tabular data - can already be parsed reliably.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
Often, CSV files are formatted simply enough that you can get by
|
||||
reading them line-by-line and splitting on the commas which delimit
|
||||
the fields. This is especially true if all the data being read is
|
||||
numeric. This approach may work for awhile, then come back to bite
|
||||
you in the butt when somebody puts something unexpected in the data
|
||||
like a comma. As you dig into the problem you may eventually come to
|
||||
the conclusion that you can solve the problem using regular
|
||||
expressions. This will work for awhile, then break mysteriously one
|
||||
day. The problem grows, so you dig deeper and eventually realize that
|
||||
you need a purpose-built parser for the format.
|
||||
|
||||
CSV formats are not well-defined and different implementations have a
|
||||
number of subtle corner cases. It has been suggested that the "V" in
|
||||
the acronym stands for "Vague" instead of "Values". Different
|
||||
delimiters and quoting characters are just the start. Some programs
|
||||
generate whitespace after each delimiter which is not part of the
|
||||
following field. Others quote embedded quoting characters by doubling
|
||||
them, others by prefixing them with an escape character. The list of
|
||||
weird ways to do things can seem endless.
|
||||
|
||||
All this variability means it is difficult for programmers to reliably
|
||||
parse CSV files from many sources or generate CSV files designed to be
|
||||
fed to specific external programs without a thorough understanding of
|
||||
those sources and programs. This PEP and the software which accompany
|
||||
it attempt to make the process less fragile.
|
||||
|
||||
|
||||
Existing Modules
|
||||
================
|
||||
|
||||
Three widely available modules enable programmers to read and write
|
||||
CSV files:
|
||||
This problem has been tackled before. At least three modules
|
||||
currently available in the Python community enable programmers to read
|
||||
and write CSV files:
|
||||
|
||||
- Object Craft's CSV module [2]_
|
||||
|
||||
|
@ -53,73 +99,65 @@ CSV files:
|
|||
Each has a different API, making it somewhat difficult for programmers
|
||||
to switch between them. More of a problem may be that they interpret
|
||||
some of the CSV corner cases differently, so even after surmounting
|
||||
the differences in the module APIs, the programmer has to also deal
|
||||
with semantic differences between the packages.
|
||||
|
||||
|
||||
Rationale
|
||||
=========
|
||||
|
||||
By defining common APIs for reading and writing CSV files, we make it
|
||||
easier for programmers to choose an appropriate module to suit their
|
||||
needs, and make it easier to switch between modules if their needs
|
||||
change. This PEP also forms a set of requirements for creation of a
|
||||
module which will hopefully be incorporated into the Python
|
||||
distribution.
|
||||
|
||||
CSV formats are not well-defined and different implementations have a
|
||||
number of subtle corner cases. It has been suggested that the "V" in
|
||||
the acronym stands for "Vague" instead of "Values". Different
|
||||
delimiters and quoting characters are just the start. Some programs
|
||||
generate whitespace after the delimiter. Others quote embedded
|
||||
quoting characters by doubling them or prefixing them with an escape
|
||||
character. The list of weird ways to do things seems nearly endless.
|
||||
|
||||
Unfortunately, all this variability and subtlety means it is difficult
|
||||
for programmers to reliably parse CSV files from many sources or
|
||||
generate CSV files designed to be fed to specific external programs
|
||||
without deep knowledge of those sources and programs. This PEP and
|
||||
the software which accompany it attempt to make the process less
|
||||
fragile.
|
||||
the differences between the different module APIs, the programmer has
|
||||
to also deal with semantic differences between the packages.
|
||||
|
||||
|
||||
Module Interface
|
||||
================
|
||||
|
||||
The module supports two basic APIs, one for reading and one for
|
||||
writing. The basic reading interface is::
|
||||
This PEP supports three basic APIs, one to read and parse CSV files,
|
||||
one to write them, and one to identify different CSV dialects to the
|
||||
readers and writers.
|
||||
|
||||
|
||||
Reading CSV Files
|
||||
-----------------
|
||||
|
||||
CSV readers are created with the reader factory function::
|
||||
|
||||
obj = reader(iterable [, dialect='excel']
|
||||
[optional keyword args])
|
||||
|
||||
A reader object is an iterable which takes an interable object which
|
||||
returns lines as the sole required parameter. The optional dialect
|
||||
parameter is discussed below. It also accepts several optional
|
||||
keyword arguments which define specific format settings for the parser
|
||||
(see the section "Formatting Parameters"). Readers are typically used
|
||||
as follows::
|
||||
A reader object is an iterator which takes an iterable object
|
||||
returning lines as the sole required parameter. If it supports a
|
||||
binary mode (file objects do), the iterable argument to the reader
|
||||
function must have been opened in binary mode. This gives the reader
|
||||
object full control over the interpretation of the file's contents.
|
||||
The optional dialect parameter is discussed below. The reader
|
||||
function also accepts several optional keyword arguments which define
|
||||
specific format settings for the parser (see the section "Formatting
|
||||
Parameters"). Readers are typically used as follows::
|
||||
|
||||
csvreader = csv.reader(file("some.csv"))
|
||||
for row in csvreader:
|
||||
process(row)
|
||||
|
||||
The writing interface is similar::
|
||||
Each row returned by a reader object is a list of strings or Unicode
|
||||
objects.
|
||||
|
||||
obj = writer(fileobj [, dialect='excel'], [, fieldnames=seq]
|
||||
When both a dialect parameter and individual formatting parameters are
|
||||
passed to the constructor, first the dialect is queried for formatting
|
||||
parameters, then individual formatting parameters are examined.
|
||||
|
||||
|
||||
Writing CSV Files
|
||||
-----------------
|
||||
|
||||
Creating writers is similar::
|
||||
|
||||
obj = writer(fileobj [, dialect='excel'],
|
||||
[optional keyword args])
|
||||
|
||||
A writer object is a wrapper around a file-like object opened for
|
||||
writing. It accepts the same optional keyword parameters as the
|
||||
reader constructor. In addition, it accepts an optional fieldnames
|
||||
argument. This is a sequence that defines the order of fields in the
|
||||
output file. It allows the write() method to accept mapping objects
|
||||
as well as sequence objects.
|
||||
writing in binary mode (if such a distinction is made). It accepts
|
||||
the same optional keyword parameters as the reader constructor.
|
||||
|
||||
Writers are typically used as follows::
|
||||
|
||||
csvwriter = csv.writer(file("some.csv", "w"))
|
||||
for row in someiterable:
|
||||
csvwriter.write(row)
|
||||
csvwriter.writerow(row)
|
||||
|
||||
To generate a set of field names as the first row of the CSV file, the
|
||||
programmer must explicitly write it, e.g.::
|
||||
|
@ -132,111 +170,150 @@ programmer must explicitly write it, e.g.::
|
|||
or arrange for it to be the first row in the iterable being written.
|
||||
|
||||
|
||||
Dialects
|
||||
--------
|
||||
Managing Different Dialects
|
||||
---------------------------
|
||||
|
||||
Readers and writers support a dialect argument which is just a
|
||||
convenient handle on a group of lower level parameters.
|
||||
Because CSV is a somewhat ill-defined format, there are plenty of ways
|
||||
one CSV file can differ from another, yet contain exactly the same
|
||||
data. Many tools which can import or export tabular data allow the
|
||||
user to indicate the field delimiter, quote character, line
|
||||
terminator, and other characteristics of the file. These can be
|
||||
fairly easily determined, but are still mildly annoying to figure out,
|
||||
and make for fairly long function calls when specified individually.
|
||||
|
||||
When dialect is a string it identifies one of the dialects which is
|
||||
known to the module, otherwise it is processed as a dialect class as
|
||||
described below.
|
||||
To try and minimize the difficulty of figuring out and specifying a
|
||||
bunch of formatting parameters, reader and writer objects support a
|
||||
dialect argument which is just a convenient handle on a group of these
|
||||
lower level parameters. When a dialect is given as a string it
|
||||
identifies one of the dialects known to the module via its
|
||||
registration functions, otherwise it must be an instance of the
|
||||
Dialect class as described below.
|
||||
|
||||
Dialects will generally be named after applications or organizations
|
||||
which define specific sets of format constraints. The initial dialect
|
||||
is "excel", which describes the format constraints of Excel 97 and
|
||||
Excel 2000 regarding CSV input and output. Another possible dialect
|
||||
(used here only as an example) might be "gnumeric".
|
||||
which define specific sets of format constraints. Two dialects are
|
||||
defined in the module as of this writing, "excel", which describes the
|
||||
default format constraints for CSV file export by Excel 97 and Excel
|
||||
2000, and "excel-tab", which is the same as "excel" but specifies an
|
||||
ASCII TAB character as the field delimiter.
|
||||
|
||||
Dialects are implemented as attribute only classes to enable users to
|
||||
construct variant dialects by subclassing. The "excel" dialect is
|
||||
implemented as follows::
|
||||
construct variant dialects by subclassing. The "excel" dialect is a
|
||||
subclass of Dialect and is defined as follows::
|
||||
|
||||
class excel:
|
||||
class Dialect:
|
||||
# placeholders
|
||||
delimiter = None
|
||||
quotechar = None
|
||||
escapechar = None
|
||||
doublequote = None
|
||||
skipinitialspace = None
|
||||
lineterminator = None
|
||||
quoting = None
|
||||
|
||||
class excel(Dialect):
|
||||
delimiter = ','
|
||||
quotechar = '"'
|
||||
escapechar = None
|
||||
doublequote = True
|
||||
skipinitialspace = False
|
||||
lineterminator = '\r\n'
|
||||
quoting = QUOTE_MINIMAL
|
||||
|
||||
An excel tab separated dialect can then be defined in user code as
|
||||
follows::
|
||||
The "excel-tab" dialect is defined as::
|
||||
|
||||
class exceltsv(csv.excel):
|
||||
class exceltsv(excel):
|
||||
delimiter = '\t'
|
||||
|
||||
Three functions are defined in the API to set, get and list dialects::
|
||||
(For a description of the individual formatting parameters see the
|
||||
section "Formatting Parameters".)
|
||||
|
||||
To enable string references to specific dialects, the module defines
|
||||
several functions::
|
||||
|
||||
register_dialect(name, dialect)
|
||||
dialect = get_dialect(name)
|
||||
known_dialects = list_dialects()
|
||||
names = list_dialects()
|
||||
register_dialect(name, dialect)
|
||||
unregister_dialect(name)
|
||||
|
||||
The dialect parameter is a class or instance whose attributes are the
|
||||
formatting parameters defined in the next section. The
|
||||
list_dialects() function returns all the registered dialect names as
|
||||
given in previous register_dialect() calls (both predefined and
|
||||
user-defined).
|
||||
``get_dialect()`` returns the dialect instance associated with the
|
||||
given name. ``list_dialects()`` returns a list of all registered
|
||||
dialect names. ``register_dialects()`` associates a string name with
|
||||
a dialect class. ``unregister_dialect()`` deletes a name/dialect
|
||||
association.
|
||||
|
||||
|
||||
Formatting Parameters
|
||||
---------------------
|
||||
|
||||
Both the reader and writer constructors take several specific
|
||||
formatting parameters, specified as keyword parameters.
|
||||
In addition to the dialect argument, both the reader and writer
|
||||
constructors take several specific formatting parameters, specified as
|
||||
keyword parameters. The formatting parameters understood are::
|
||||
|
||||
- ``quotechar`` specifies a one-character string to use as the quoting
|
||||
character. It defaults to '"'. Setting this to None has the same
|
||||
effect as setting quoting to csv.QUOTE_NONE.
|
||||
- ``quotechar`` specifies a one-character string to use as the
|
||||
quoting character. It defaults to '"'. Setting this to None
|
||||
has the same effect as setting quoting to csv.QUOTE_NONE.
|
||||
|
||||
- ``delimiter`` specifies a one-character string to use as the field
|
||||
separator. It defaults to ','.
|
||||
- ``delimiter`` specifies a one-character string to use as the
|
||||
field separator. It defaults to ','.
|
||||
|
||||
- ``escapechar`` specifies a one-character string used to escape the
|
||||
delimiter when quotechar is set to None.
|
||||
- ``escapechar`` specifies a one-character string used to escape
|
||||
the delimiter when quotechar is set to None.
|
||||
|
||||
- ``skipinitialspace`` specifies how to interpret whitespace which
|
||||
immediately follows a delimiter. It defaults to False, which means
|
||||
that whitespace immediately following a delimiter is part of the
|
||||
following field.
|
||||
- ``skipinitialspace`` specifies how to interpret whitespace which
|
||||
immediately follows a delimiter. It defaults to False, which
|
||||
means that whitespace immediately following a delimiter is part
|
||||
of the following field.
|
||||
|
||||
- ``lineterminator`` specifies the character sequence which should
|
||||
terminate rows.
|
||||
- ``lineterminator`` specifies the character sequence which should
|
||||
terminate rows.
|
||||
|
||||
- ``quoting`` controls when quotes should be generated by the
|
||||
writer. It can take on any of the following module constants::
|
||||
- ``quoting`` controls when quotes should be generated by the
|
||||
writer. It can take on any of the following module constants::
|
||||
|
||||
csv.QUOTE_MINIMAL means only when required, for example, when a
|
||||
field contains either the quotechar or the delimiter
|
||||
* csv.QUOTE_MINIMAL means only when required, for example,
|
||||
when a field contains either the quotechar or the delimiter
|
||||
|
||||
csv.QUOTE_ALL means that quotes are always placed around fields.
|
||||
* csv.QUOTE_ALL means that quotes are always placed around
|
||||
fields.
|
||||
|
||||
csv.QUOTE_NONNUMERIC means that quotes are always placed around
|
||||
fields which contain characters other than [+-0-9.].
|
||||
* csv.QUOTE_NONNUMERIC means that quotes are always placed
|
||||
around nonnumeric fields.
|
||||
|
||||
csv.QUOTE_NONE means that quotes are never placed around
|
||||
fields.
|
||||
* csv.QUOTE_NONE means that quotes are never placed around
|
||||
fields.
|
||||
|
||||
- ``doublequote`` controls the handling of quotes inside fields. When
|
||||
True two consecutive quotes are interpreted as one during read, and
|
||||
when writing, each quote is written as two quotes.
|
||||
|
||||
- are there more to come?
|
||||
- ``doublequote`` controls the handling of quotes inside fields.
|
||||
When True two consecutive quotes are interpreted as one during
|
||||
read, and when writing, each quote is written as two quotes.
|
||||
|
||||
When processing a dialect setting and one or more of the other
|
||||
optional parameters, the dialect parameter is processed first, then
|
||||
the others are processed. This makes it easy to choose a dialect,
|
||||
then override one or more of the settings without defining a new
|
||||
dialect class. For example, if a CSV file was generated by Excel 2000
|
||||
using single quotes as the quote character and TAB as the delimiter,
|
||||
you could create a reader like::
|
||||
optional parameters, the dialect parameter is processed before the
|
||||
individual formatting parameters. This makes it easy to choose a
|
||||
dialect, then override one or more of the settings without defining a
|
||||
new dialect class. For example, if a CSV file was generated by Excel
|
||||
2000 using single quotes as the quote character and a colon as the
|
||||
delimiter, you could create a reader like::
|
||||
|
||||
csvreader = csv.reader(file("some.csv"), dialect="excel",
|
||||
quotechar="'", delimiter='\t')
|
||||
quotechar="'", delimiter=':')
|
||||
|
||||
Other details of how Excel generates CSV files would be handled
|
||||
automatically.
|
||||
automatically because of the reference to the "excel" dialect.
|
||||
|
||||
|
||||
Reader Objects
|
||||
--------------
|
||||
|
||||
Reader objects are iterables whose next() method returns a sequence of
|
||||
strings, one string per field in the row.
|
||||
|
||||
|
||||
Writer Objects
|
||||
--------------
|
||||
|
||||
Writer objects have two methods, writerow() and writerows(). The
|
||||
former accepts an iterable (typically a list) of fields which are to
|
||||
be written to the output. The latter accepts a list of iterables and
|
||||
calls writerow() for each.
|
||||
|
||||
|
||||
Implementation
|
||||
|
@ -257,63 +334,54 @@ The sample implementation [1]_ includes a set of test cases.
|
|||
Issues
|
||||
======
|
||||
|
||||
- Should a parameter control how consecutive delimiters are
|
||||
interpreted? Our thought is "no". Consecutive delimiters should
|
||||
always denote an empty field.
|
||||
1. Should a parameter control how consecutive delimiters are
|
||||
interpreted? Our thought is "no". Consecutive delimiters should
|
||||
always denote an empty field.
|
||||
|
||||
- What about Unicode? Is it sufficient to pass a file object gotten
|
||||
from codecs.open()? For example::
|
||||
2. What about Unicode? Is it sufficient to pass a file object gotten
|
||||
from codecs.open()? For example::
|
||||
|
||||
csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
|
||||
csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
|
||||
|
||||
csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
|
||||
csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
|
||||
|
||||
In the first example, text would be assumed to be encoded as cp1252.
|
||||
Should the system be aggressive in converting to Unicode or should
|
||||
Unicode strings only be returned if necessary?
|
||||
In the first example, text would be assumed to be encoded as cp1252.
|
||||
Should the system be aggressive in converting to Unicode or should
|
||||
Unicode strings only be returned if necessary?
|
||||
|
||||
In the second example, the file will take care of automatically
|
||||
encoding Unicode strings as utf-8 before writing to disk.
|
||||
In the second example, the file will take care of automatically
|
||||
encoding Unicode strings as utf-8 before writing to disk.
|
||||
|
||||
- What about alternate escape conventions? When Excel exports a file,
|
||||
it appears only the field delimiter needs to be escaped. It
|
||||
accomplishes this by quoting the entire field, then doubling any
|
||||
quote characters which appear in the field. It also quotes a field
|
||||
if the first character is a quote character. It would seem we need
|
||||
to support two modes: escape-by-quoting and escape-by-prefix. In
|
||||
addition, for the second mode, we'd have to specify the escape
|
||||
character (presumably defaulting to a backslash character).
|
||||
Note: As of this writing, the csv module doesn't handle Unicode
|
||||
data.
|
||||
|
||||
- Should there be a "fully quoted" mode for writing? What about
|
||||
"fully quoted except for numeric values"?
|
||||
3. What about alternate escape conventions? If the dialect in use
|
||||
includes an ``escapechar`` parameter which is not None and the
|
||||
``quoting`` parameter is set to QUOTE_NONE, delimiters appearing
|
||||
within fields will be prefixed by the escape character when writing
|
||||
and are expected to be prefixed by the escape character when
|
||||
reading.
|
||||
|
||||
- What about end-of-line? If I generate a CSV file on a Unix system,
|
||||
will Excel properly recognize the LF-only line terminators?
|
||||
4. Should there be a "fully quoted" mode for writing? What about
|
||||
"fully quoted except for numeric values"? Both are implemented
|
||||
(QUOTE_ALL and QUOTE_NONNUMERIC, respectively).
|
||||
|
||||
- What about conversion to other file formats? Is the list-of-lists
|
||||
output from the csvreader sufficient to feed into other writers?
|
||||
5. What about end-of-line? If I generate a CSV file on a Unix system,
|
||||
will Excel properly recognize the LF-only line terminators? Files
|
||||
must be opened for reading or writing as appropriate using binary
|
||||
mode. Specify the ``lineterminator`` sequence as '\r\n'. The
|
||||
resulting file will be written correctly.
|
||||
|
||||
- What about an option to generate list-of-dict output from the reader
|
||||
and accept list-of-dicts by the writer? This makes manipulating
|
||||
individual rows easier since each one is independent, but you lose
|
||||
field order when writing and have to tell the writer object the
|
||||
order the fields should appear in the file.
|
||||
6. What about an option to generate dicts from the reader and accept
|
||||
dicts by the writer? See the DictReader and DictWriter classes in
|
||||
csv.py.
|
||||
|
||||
- Are quote character and delimiters limited to single characters? I
|
||||
had a client not that long ago who wrote their own flat file format
|
||||
with a delimiter of ":::".
|
||||
8. Are quote character and delimiters limited to single characters?
|
||||
For the time being, yes.
|
||||
|
||||
- How should rows of different lengths be handled? The options seem
|
||||
to be:
|
||||
|
||||
* raise an exception when a row is encountered whose length differs
|
||||
from the previous row
|
||||
|
||||
* silently return short rows
|
||||
|
||||
* allow the caller to specify the desired row length and what to do
|
||||
when rows of a different length are encountered: ignore, truncate,
|
||||
pad, raise exception, etc.
|
||||
9. How should rows of different lengths be handled? Interpretation of
|
||||
the data is the application's job. There is no such thing as a
|
||||
"short row" or a "long row" at this level.
|
||||
|
||||
|
||||
References
|
||||
|
|
Loading…
Reference in New Issue