python-peps/pep-0305.txt

PEP: 305
Title: CSV File API
Version: $Revision$
Last-Modified: $Date$
Author: Kevin Altis, Dave Cole, Andrew McNamara, Skip Montanaro, Cliff Wells
Discussions-To: <csv@mail.mojam.com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 26-Jan-2003
Post-History: 31-Jan-2003


Abstract
========

The Comma Separated Values (CSV) file format is the most common import
and export format for spreadsheets and databases.  Although many CSV
files are simple to parse, the format is not formally defined by a
stable specification and is subtle enough that parsing lines of a CSV
file with something like ``line.split(",")`` is bound to fail.  This
PEP defines an API for reading and writing CSV files which should make
it possible for programmers to select a CSV module which meets their
requirements.  It is accompanied by a corresponding module which
implements the API.


To Do (Notes for the Interested and Ambitious)
==============================================

- Need to better explain the advantages of a purpose-built csv module
  over the simple ",".join() and [].split() approach.

- Need to complete initial list of formatting parameters and settle on
  names.

- Better motivation for the choice of passing a file object to the
  constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html


Existing Modules
================

Three widely available modules enable programmers to read and write
CSV files:

- Object Craft's CSV module [2]_

- Cliff Wells' Python-DSV module [3]_

- Laurence Tratt's ASV module [4]_

Each has a different API, making it somewhat difficult for programmers
to switch between them.  More of a problem may be that they interpret
some of the CSV corner cases differently, so even after surmounting
the differences in the module APIs, the programmer has to also deal
with semantic differences between the packages.


Rationale
=========

By defining common APIs for reading and writing CSV files, we make it
easier for programmers to choose an appropriate module to suit their
needs, and make it easier to switch between modules if their needs
change.  This PEP also forms a set of requirements for creation of a
module which will hopefully be incorporated into the Python
distribution.

CSV formats are not well-defined and different implementations have a
number of subtle corner cases.  It has been suggested that the "V" in
the acronym stands for "Vague" instead of "Values".  Different
delimiters and quoting characters are just the start.  Some programs
generate whitespace after the delimiter.  Others quote embedded
quoting characters by doubling them or prefixing them with an escape
character.  The list of weird ways to do things seems nearly endless.

Unfortunately, all this variability and subtlety means it is difficult
for programmers to reliably parse CSV files from many sources or
generate CSV files designed to be fed to specific external programs
without deep knowledge of those sources and programs.  This PEP and
the software which accompany it attempt to make the process less
fragile.


Module Interface
================

The module supports two basic APIs, one for reading and one for
writing.  The basic reading interface is::

    obj = reader(iterable [, dialect='excel']
                 [optional keyword args])

A reader object is an iterable which takes an interable object which
returns lines as the sole required parameter.  The optional dialect
parameter is discussed below.  It also accepts several optional
keyword arguments which define specific format settings for the parser
(see the section "Formatting Parameters").  Readers are typically used
as follows::

    csvreader = csv.reader(file("some.csv"))
    for row in csvreader:
        process(row)

The writing interface is similar::

    obj = writer(fileobj [, dialect='excel'], [, fieldnames=seq]
                 [optional keyword args])

A writer object is a wrapper around a file-like object opened for
writing.  It accepts the same optional keyword parameters as the
reader constructor.  In addition, it accepts an optional fieldnames
argument.  This is a sequence that defines the order of fields in the
output file.  It allows the write() method to accept mapping objects
as well as sequence objects.

Writers are typically used as follows::

    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
        csvwriter.write(row)

To generate a set of field names as the first row of the CSV file, the
programmer must explicitly write it, e.g.::

    csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
    csvwriter.write(names)
    for row in someiterable:
        csvwriter.write(row)

or arrange for it to be the first row in the iterable being written.


Dialects
--------

Readers and writers support a dialect argument which is just a
convenient handle on a group of lower level parameters.

When dialect is a string it identifies one of the dialects which is
known to the module, otherwise it is processed as a dialect class as
described below.

Dialects will generally be named after applications or organizations
which define specific sets of format constraints.  The initial dialect
is "excel", which describes the format constraints of Excel 97 and
Excel 2000 regarding CSV input and output.  Another possible dialect
(used here only as an example) might be "gnumeric".

Dialects are implemented as attribute only classes to enable users to
construct variant dialects by subclassing.  The "excel" dialect is
implemented as follows::

    class excel:
        delimiter = ','
        quotechar = '"'
        escapechar = None
        doublequote = True
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL

An excel tab separated dialect can then be defined in user code as
follows::

    class exceltsv(csv.excel):
        delimiter = '\t'

Three functions are defined in the API to set, get and list dialects::

    register_dialect(name, dialect)
    dialect = get_dialect(name)
    known_dialects = list_dialects()

The dialect parameter is a class or instance whose attributes are the
formatting parameters defined in the next section.  The
list_dialects() function returns all the registered dialect names as
given in previous register_dialect() calls (both predefined and
user-defined).


Formatting Parameters
---------------------

Both the reader and writer constructors take several specific
formatting parameters, specified as keyword parameters.

- ``quotechar`` specifies a one-character string to use as the quoting
  character.  It defaults to '"'.  Setting this to None has the same
  effect as setting quoting to csv.QUOTE_NONE.

- ``delimiter`` specifies a one-character string to use as the field
  separator.  It defaults to ','.

- ``escapechar`` specifies a one-character string used to escape the
  delimiter when quotechar is set to None.

- ``skipinitialspace`` specifies how to interpret whitespace which
  immediately follows a delimiter.  It defaults to False, which means
  that whitespace immediately following a delimiter is part of the
  following field.

- ``lineterminator`` specifies the character sequence which should
  terminate rows.

- ``quoting`` controls when quotes should be generated by the
  writer.  It can take on any of the following module constants::

    csv.QUOTE_MINIMAL means only when required, for example, when a
    field contains either the quotechar or the delimiter

    csv.QUOTE_ALL means that quotes are always placed around fields.

    csv.QUOTE_NONNUMERIC means that quotes are always placed around
    fields which contain characters other than [+-0-9.].

    csv.QUOTE_NONE means that quotes are never placed around
    fields.

- ``doublequote`` controls the handling of quotes inside fields.  When
  True two consecutive quotes are interpreted as one during read, and
  when writing, each quote is written as two quotes.

- are there more to come?

When processing a dialect setting and one or more of the other
optional parameters, the dialect parameter is processed first, then
the others are processed.  This makes it easy to choose a dialect,
then override one or more of the settings without defining a new
dialect class.  For example, if a CSV file was generated by Excel 2000
using single quotes as the quote character and TAB as the delimiter,
you could create a reader like::

    csvreader = csv.reader(file("some.csv"), dialect="excel",
                           quotechar="'", delimiter='\t')

Other details of how Excel generates CSV files would be handled
automatically.


Implementation
==============

There is a sample implementation available.  [1]_ The goal is for it
to efficiently implement the API described in the PEP.  It is heavily
based on the Object Craft csv module. [2]_


Testing
=======

The sample implementation [1]_ includes a set of test cases.


Issues
======

- Should a parameter control how consecutive delimiters are
  interpreted?  Our thought is "no".  Consecutive delimiters should
  always denote an empty field.

- What about Unicode?  Is it sufficient to pass a file object gotten
  from codecs.open()?  For example::

    csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))

    csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))

  In the first example, text would be assumed to be encoded as cp1252.
  Should the system be aggressive in converting to Unicode or should
  Unicode strings only be returned if necessary?

  In the second example, the file will take care of automatically
  encoding Unicode strings as utf-8 before writing to disk.

- What about alternate escape conventions?  When Excel exports a file,
  it appears only the field delimiter needs to be escaped.  It
  accomplishes this by quoting the entire field, then doubling any
  quote characters which appear in the field.  It also quotes a field
  if the first character is a quote character.  It would seem we need
  to support two modes: escape-by-quoting and escape-by-prefix.  In
  addition, for the second mode, we'd have to specify the escape
  character (presumably defaulting to a backslash character).

- Should there be a "fully quoted" mode for writing?  What about
  "fully quoted except for numeric values"?

- What about end-of-line?  If I generate a CSV file on a Unix system,
  will Excel properly recognize the LF-only line terminators?

- What about conversion to other file formats?  Is the list-of-lists
  output from the csvreader sufficient to feed into other writers?

- What about an option to generate list-of-dict output from the reader
  and accept list-of-dicts by the writer?  This makes manipulating
  individual rows easier since each one is independent, but you lose
  field order when writing and have to tell the writer object the
  order the fields should appear in the file.

- Are quote character and delimiters limited to single characters?  I
  had a client not that long ago who wrote their own flat file format
  with a delimiter of ":::".

- How should rows of different lengths be handled?  The options seem
  to be:

  * raise an exception when a row is encountered whose length differs
    from the previous row

  * silently return short rows

  * allow the caller to specify the desired row length and what to do
    when rows of a different length are encountered: ignore, truncate,
    pad, raise exception, etc.


References
==========

.. [1] csv module, Python Sandbox
   (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)

.. [2] csv module, Object Craft
   (http://www.object-craft.com.au/projects/csv)

.. [3] Python-DSV module, Wells
   (http://sourceforge.net/projects/python-dsv/)

.. [4] ASV module, Tratt
   (http://tratt.net/laurie/python/asv/)

There are many references to other CSV-related projects on the Web.  A
few are included here.


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End:
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								PEP: 305
-												capitalized title

											
										
										
											2003-01-28 23:23:26 -05:00
+								Title: CSV File API
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								Version: $Revision$
 								Last-Modified: $Date$
-												remove authors' emails, add Discussion-To: to try and encourage feedback to
go to the mailing list.

											
										
										
											2003-01-31 16:55:38 -05:00
+								Author: Kevin Altis, Dave Cole, Andrew McNamara, Skip Montanaro, Cliff Wells
 								Discussions-To: <csv@mail.mojam.com>
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								Status: Draft
-												Belatedly added Dave and Andrew as authors.  Changed Type to Standards
Track.

											
										
										
											2003-01-29 08:36:59 -05:00
+								Type: Standards Track
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								Content-Type: text/x-rst
 								Created: 26-Jan-2003
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								Post-History: 31-Jan-2003
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								Abstract
 								========
 								The Comma Separated Values (CSV) file format is the most common import
 								and export format for spreadsheets and databases.  Although many CSV
 								files are simple to parse, the format is not formally defined by a
 								stable specification and is subtle enough that parsing lines of a CSV
 								file with something like ``line.split(",")`` is bound to fail.  This
 								PEP defines an API for reading and writing CSV files which should make
 								it possible for programmers to select a CSV module which meets their
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								requirements.  It is accompanied by a corresponding module which
 								implements the API.
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												add some To Do items (to David Goodger - don't shoot me because the text now
has embedded URLs - they are temporary. ;-).
added [optional keyword args] to the constructor prototypes.

											
										
										
											2003-01-30 07:53:40 -05:00
+								To Do (Notes for the Interested and Ambitious)
 								==============================================
 								- Need to better explain the advantages of a purpose-built csv module
 								  over the simple ",".join() and [].split() approach.
 								- Need to complete initial list of formatting parameters and settle on
 								  names.
 								- Better motivation for the choice of passing a file object to the
 								  constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								Existing Modules
 								================
 								Three widely available modules enable programmers to read and write
 								CSV files:
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- Object Craft's CSV module [2]_
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- Cliff Wells' Python-DSV module [3]_
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- Laurence Tratt's ASV module [4]_
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								Each has a different API, making it somewhat difficult for programmers
 								to switch between them.  More of a problem may be that they interpret
 								some of the CSV corner cases differently, so even after surmounting
 								the differences in the module APIs, the programmer has to also deal
 								with semantic differences between the packages.
 								Rationale
 								=========
 								By defining common APIs for reading and writing CSV files, we make it
 								easier for programmers to choose an appropriate module to suit their
 								needs, and make it easier to switch between modules if their needs
 								change.  This PEP also forms a set of requirements for creation of a
 								module which will hopefully be incorporated into the Python
 								distribution.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								CSV formats are not well-defined and different implementations have a
 								number of subtle corner cases.  It has been suggested that the "V" in
 								the acronym stands for "Vague" instead of "Values".  Different
 								delimiters and quoting characters are just the start.  Some programs
 								generate whitespace after the delimiter.  Others quote embedded
 								quoting characters by doubling them or prefixing them with an escape
 								character.  The list of weird ways to do things seems nearly endless.
 								Unfortunately, all this variability and subtlety means it is difficult
 								for programmers to reliably parse CSV files from many sources or
 								generate CSV files designed to be fed to specific external programs
 								without deep knowledge of those sources and programs.  This PEP and
 								the software which accompany it attempt to make the process less
 								fragile.
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								Module Interface
 								================
 								The module supports two basic APIs, one for reading and one for
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								writing.  The basic reading interface is::
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												default dialect is now "excel", not "excel2000".

											
										
										
											2003-02-02 21:07:37 -05:00
+								    obj = reader(iterable [, dialect='excel']
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								                 [optional keyword args])
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								A reader object is an iterable which takes an interable object which
 								returns lines as the sole required parameter.  The optional dialect
-												add some To Do items (to David Goodger - don't shoot me because the text now
has embedded URLs - they are temporary. ;-).
added [optional keyword args] to the constructor prototypes.

											
										
										
											2003-01-30 07:53:40 -05:00
+								parameter is discussed below.  It also accepts several optional
 								keyword arguments which define specific format settings for the parser
 								(see the section "Formatting Parameters").  Readers are typically used
 								as follows::
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								    csvreader = csv.reader(file("some.csv"))
 								    for row in csvreader:
 								        process(row)
 								The writing interface is similar::
-												default dialect is now "excel", not "excel2000".

											
										
										
											2003-02-02 21:07:37 -05:00
+								    obj = writer(fileobj [, dialect='excel'], [, fieldnames=seq]
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								                 [optional keyword args])
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								A writer object is a wrapper around a file-like object opened for
-												add some To Do items (to David Goodger - don't shoot me because the text now
has embedded URLs - they are temporary. ;-).
added [optional keyword args] to the constructor prototypes.

											
										
										
											2003-01-30 07:53:40 -05:00
+								writing.  It accepts the same optional keyword parameters as the
 								reader constructor.  In addition, it accepts an optional fieldnames
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								argument.  This is a sequence that defines the order of fields in the
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								output file.  It allows the write() method to accept mapping objects
 								as well as sequence objects.
 								Writers are typically used as follows::
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								    csvwriter = csv.writer(file("some.csv", "w"))
 								    for row in someiterable:
 								        csvwriter.write(row)
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								To generate a set of field names as the first row of the CSV file, the
 								programmer must explicitly write it, e.g.::
 								    csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
 								    csvwriter.write(names)
 								    for row in someiterable:
 								        csvwriter.write(row)
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								or arrange for it to be the first row in the iterable being written.
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
 								Dialects
 								--------
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								Readers and writers support a dialect argument which is just a
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								convenient handle on a group of lower level parameters.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								When dialect is a string it identifies one of the dialects which is
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								known to the module, otherwise it is processed as a dialect class as
 								described below.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								Dialects will generally be named after applications or organizations
 								which define specific sets of format constraints.  The initial dialect
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								is "excel", which describes the format constraints of Excel 97 and
 								Excel 2000 regarding CSV input and output.  Another possible dialect
 								(used here only as an example) might be "gnumeric".
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								Dialects are implemented as attribute only classes to enable users to
 								construct variant dialects by subclassing.  The "excel" dialect is
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								implemented as follows::
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								    class excel:
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								        delimiter = ','
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								        quotechar = '"'
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								        escapechar = None
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								        doublequote = True
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								        skipinitialspace = False
 								        lineterminator = '\r\n'
-												Use symbolic constants for quoting parameter rather than a string.

											
										
										
											2003-01-30 08:34:29 -05:00
+								        quoting = QUOTE_MINIMAL
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
 								An excel tab separated dialect can then be defined in user code as
 								follows::
-												default dialect is now "excel", not "excel2000".

											
										
										
											2003-02-02 21:07:37 -05:00
+								    class exceltsv(csv.excel):
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								        delimiter = '\t'
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								Three functions are defined in the API to set, get and list dialects::
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												set_dialect is the wrong name.  It conveys the notion of a single dialect.
register_dialect is better.

											
										
										
											2003-02-02 21:25:26 -05:00
+								    register_dialect(name, dialect)
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								    dialect = get_dialect(name)
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								    known_dialects = list_dialects()
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								The dialect parameter is a class or instance whose attributes are the
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								formatting parameters defined in the next section.  The
 								list_dialects() function returns all the registered dialect names as
-												set_dialect is the wrong name.  It conveys the notion of a single dialect.
register_dialect is better.

											
										
										
											2003-02-02 21:25:26 -05:00
+								given in previous register_dialect() calls (both predefined and
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								user-defined).
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
 								Formatting Parameters
 								---------------------
 								Both the reader and writer constructors take several specific
-												zap incorrect statement about formatting parameters and the dialect
registry.

											
										
										
											2003-02-02 22:01:48 -05:00
+								formatting parameters, specified as keyword parameters.
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- ``quotechar`` specifies a one-character string to use as the quoting
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								  character.  It defaults to '"'.  Setting this to None has the same
 								  effect as setting quoting to csv.QUOTE_NONE.
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- ``delimiter`` specifies a one-character string to use as the field
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								  separator.  It defaults to ','.
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								- ``escapechar`` specifies a one-character string used to escape the
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
+								  delimiter when quotechar is set to None.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- ``skipinitialspace`` specifies how to interpret whitespace which
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								  immediately follows a delimiter.  It defaults to False, which means
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								  that whitespace immediately following a delimiter is part of the
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								  following field.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- ``lineterminator`` specifies the character sequence which should
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
+								  terminate rows.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- ``quoting`` controls when quotes should be generated by the
 								  writer.  It can take on any of the following module constants::
 								    csv.QUOTE_MINIMAL means only when required, for example, when a
 								    field contains either the quotechar or the delimiter
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								    csv.QUOTE_ALL means that quotes are always placed around fields.
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								    csv.QUOTE_NONNUMERIC means that quotes are always placed around
 								    fields which contain characters other than [+-0-9.].
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
-												Changed the csv.reader() fileobj argument to interable.  This give us
much more flexibility in processing filtered data.
Made the example excel dialect match the dialect in csv.py.
Added explanation of doublequote.
Added explanation of csv.QUOTE_NONE.

											
										
										
											2003-02-02 07:25:23 -05:00
+								    csv.QUOTE_NONE means that quotes are never placed around
 								    fields.
 								- ``doublequote`` controls the handling of quotes inside fields.  When
 								  True two consecutive quotes are interpreted as one during read, and
 								  when writing, each quote is written as two quotes.
-												Trying to bring PEP up to date with discussions on mailing list.  I hope I
have not misinterpreted the conclusions.
* dialect argument is now either a string to identify one of the internally
  defined parameter sets, otherwise it is an object which contains
  attributes which correspond to the parameter set.
* Altered set_dialect() to take dialect name and dialect object.
* Altered get_dialect() to take dialect name and return dialect object.
* Fleshed out formatting parameters, adding escapechar, lineterminator,
  quoting.

											
										
										
											2003-01-30 07:11:27 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								- are there more to come?
-												started reorganizing the information about low-level formatting parameters.
Define dialects in their own subsection.  Define low-level parameters in a
separate subsection.  Define set_dialect() and get_dialect() module-level
functions.

More to be done, but I have to get to work... ;-)

											
										
										
											2003-01-29 09:09:45 -05:00
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								When processing a dialect setting and one or more of the other
 								optional parameters, the dialect parameter is processed first, then
 								the others are processed.  This makes it easy to choose a dialect,
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								then override one or more of the settings without defining a new
 								dialect class.  For example, if a CSV file was generated by Excel 2000
 								using single quotes as the quote character and TAB as the delimiter,
 								you could create a reader like::
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								    csvreader = csv.reader(file("some.csv"), dialect="excel",
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								                           quotechar="'", delimiter='\t')
 								Other details of how Excel generates CSV files would be handled
 								automatically.
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								Implementation
 								==============
 								There is a sample implementation available.  [1]_ The goal is for it
 								to efficiently implement the API described in the PEP.  It is heavily
 								based on the Object Craft csv module. [2]_
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								Testing
 								=======
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								The sample implementation [1]_ includes a set of test cases.
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
 								Issues
 								======
 								- Should a parameter control how consecutive delimiters are
 								  interpreted?  Our thought is "no".  Consecutive delimiters should
 								  always denote an empty field.
 								- What about Unicode?  Is it sufficient to pass a file object gotten
 								  from codecs.open()?  For example::
 								    csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
 								    csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
 								  In the first example, text would be assumed to be encoded as cp1252.
 								  Should the system be aggressive in converting to Unicode or should
 								  Unicode strings only be returned if necessary?
 								  In the second example, the file will take care of automatically
 								  encoding Unicode strings as utf-8 before writing to disk.
 								- What about alternate escape conventions?  When Excel exports a file,
 								  it appears only the field delimiter needs to be escaped.  It
 								  accomplishes this by quoting the entire field, then doubling any
 								  quote characters which appear in the field.  It also quotes a field
 								  if the first character is a quote character.  It would seem we need
 								  to support two modes: escape-by-quoting and escape-by-prefix.  In
 								  addition, for the second mode, we'd have to specify the escape
 								  character (presumably defaulting to a backslash character).
 								- Should there be a "fully quoted" mode for writing?  What about
 								  "fully quoted except for numeric values"?
 								- What about end-of-line?  If I generate a CSV file on a Unix system,
 								  will Excel properly recognize the LF-only line terminators?
 								- What about conversion to other file formats?  Is the list-of-lists
 								  output from the csvreader sufficient to feed into other writers?
 								- What about an option to generate list-of-dict output from the reader
 								  and accept list-of-dicts by the writer?  This makes manipulating
 								  individual rows easier since each one is independent, but you lose
 								  field order when writing and have to tell the writer object the
 								  order the fields should appear in the file.
 								- Are quote character and delimiters limited to single characters?  I
 								  had a client not that long ago who wrote their own flat file format
 								  with a delimiter of ":::".
 								- How should rows of different lengths be handled?  The options seem
 								  to be:
 								  * raise an exception when a row is encountered whose length differs
 								    from the previous row
 								  * silently return short rows
 								  * allow the caller to specify the desired row length and what to do
 								    when rows of a different length are encountered: ignore, truncate,
 								    pad, raise exception, etc.
 								References
 								==========
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								.. [1] csv module, Python Sandbox
 								   (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/)
 								.. [2] csv module, Object Craft
 								   (http://www.object-craft.com.au/projects/csv)
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								.. [3] Python-DSV module, Wells
 								   (http://sourceforge.net/projects/python-dsv/)
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
-												various cleanups
expanded Rationale a tad
added Post-History date (announcing it in a moment)
added pointer to sandbox implementation
mentioned implementation in the (massive ;-) Testing section

											
										
										
											2003-01-31 16:49:32 -05:00
+								.. [4] ASV module, Tratt
-												Added PEP 305, CSV file API

											
										
										
											2003-01-28 23:20:19 -05:00
+								   (http://tratt.net/laurie/python/asv/)
 								There are many references to other CSV-related projects on the Web.  A
 								few are included here.
 								Copyright
 								=========
 								This document has been placed in the public domain.
 								..
 								   Local Variables:
 								   mode: indented-text
 								   indent-tabs-mode: nil
 								   sentence-end-double-space: t
 								   fill-column: 70
 								   End: