python-peps/pep-0305.txt

PEP: 305
Title: CSV File API
Version: $Revision$
Last-Modified: $Date$
Author: Skip Montanaro <skip@pobox.com>,
        Kevin Altis <altis@semi-retired.com>,
        Cliff Wells <LogiplexSoftware@earthlink.net>,
        Dave Cole <djc@object-craft.com.au>,
        Andrew McNamara <andrewm@object-craft.com.au>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 26-Jan-2003
Post-History:


Abstract
========

The Comma Separated Values (CSV) file format is the most common import
and export format for spreadsheets and databases.  Although many CSV
files are simple to parse, the format is not formally defined by a
stable specification and is subtle enough that parsing lines of a CSV
file with something like ``line.split(",")`` is bound to fail.  This
PEP defines an API for reading and writing CSV files which should make
it possible for programmers to select a CSV module which meets their
requirements.


To Do (Notes for the Interested and Ambitious)
==============================================

- Need to better explain the advantages of a purpose-built csv module
  over the simple ",".join() and [].split() approach.

- Need to complete initial list of formatting parameters and settle on
  names.

- Better motivation for the choice of passing a file object to the
  constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html


Existing Modules
================

Three widely available modules enable programmers to read and write
CSV files:

- Object Craft's CSV module [1]_

- Cliff Wells's Python-DSV module [2]_

- Laurence Tratt's ASV module [3]_

Each has a different API, making it somewhat difficult for programmers
to switch between them.  More of a problem may be that they interpret
some of the CSV corner cases differently, so even after surmounting
the differences in the module APIs, the programmer has to also deal
with semantic differences between the packages.


Rationale
=========

By defining common APIs for reading and writing CSV files, we make it
easier for programmers to choose an appropriate module to suit their
needs, and make it easier to switch between modules if their needs
change.  This PEP also forms a set of requirements for creation of a
module which will hopefully be incorporated into the Python
distribution.


Module Interface
================

The module supports two basic APIs, one for reading and one for
writing.  The basic reading interface is::

    reader(fileobj [, dialect='excel2000'] [optional keyword args])

A reader object is an iterable which takes a file-like object opened
for reading as the sole required parameter.  The optional dialect
parameter is discussed below.  It also accepts several optional
keyword arguments which define specific format settings for the parser
(see the section "Formatting Parameters").  Readers are typically used
as follows::

    csvreader = csv.reader(file("some.csv"))
    for row in csvreader:
        process(row)

The writing interface is similar::

    writer(fileobj [, dialect='excel2000'], [, fieldnames=list]
           [optional keyword args])

A writer object is a wrapper around a file-like object opened for
writing.  It accepts the same optional keyword parameters as the
reader constructor.  In addition, it accepts an optional fieldnames
argument.  This is a list which defines the order of fields in the
output file.  It allows the write() method to accept mapping objects
as well as sequence objects.

Writers are typically used as follows::

    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
        csvwriter.write(row)

To generate a set of field names as the first row of the CSV file, the
programmer must explicitly write it, e.g.::

    csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names)
    csvwriter.write(names)
    for row in someiterable:
        csvwriter.write(row)


Dialects
--------

Readers and writers support a dialect argument which is just a
convenient handle on a group of lower level parameters.

When dialect is a string it identifies one of the dialect which is
known to the module, otherwise it is processed as a dialect class as
described below.

Dialects will generally be named after applications or organizations
which define specific sets of format constraints.  The initial dialect
is excel2000, which describes the format constraints of Excel 2000's
CSV format.  Another possible dialect (used here only as an example)
might be "gnumeric".

Dialects are implemented as attribute only classes to enable user to
construct variant dialects by subclassing.  The excel2000 dialect is
implemented as follows::

    class excel2000:
        quotechar = '"'
        delimiter = ','
        escapechar = None
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL

An excel tab separated dialect can then be defined in user code as
follows::

    class exceltsv(csv.excel2000):
        delimiter = '\t'

Two functions are defined in the API to set and retrieve dialects::

    set_dialect(name, dialect)
    dialect = get_dialect(name)

The dialect parameter is a class or instance whose attributes are the
formatting parameters defined in the next section.


Formatting Parameters
---------------------

Both the reader and writer constructors take several specific
formatting parameters, specified as keyword parameters.  The
parameters are also the keys for the input and output mapping objects
for the set_dialect() and get_dialect() module functions.

- quotechar specifies a one-character string to use as the quoting
  character.  It defaults to '"'.

- delimiter specifies a one-character string to use as the field
  separator.  It defaults to ','.

- escapechar specifies a one character string used to escape the
  delimiter when quotechar is set to None.

- skipinitialspace specifies how to interpret whitespace which
  immediately follows a delimiter.  It defaults to False, which means
  that whitespace immediate following a delimiter is part of the
  following field.

- lineterminator specifies the character sequence which should
  terminate rows.

- quoting controls when quotes should be generated by the
  writer.

    "minimal" means only when required, for example, when a field
    contains either the quotechar or the delimiter

    "always" means that quotes are always placed around fields.

    "nonnumeric" means that quotes are always placed around fields
    which contain characters other than [+-0-9.].

... XXX More to come XXX ...

When processing a dialect setting and one or more of the other
optional parameters, the dialect parameter is processed first, then
the others are processed.  This makes it easy to choose a dialect,
then override one or more of the settings.  For example, if a CSV file
was generated by Excel 2000 using single quotes as the quote
character and TAB as the delimiter, you could create a reader like::

    csvreader = csv.reader(file("some.csv"), dialect="excel2000",
                           quotechar="'", delimiter='\t')

Other details of how Excel generates CSV files would be handled
automatically.


Testing
=======

TBD.


Issues
======

- Should a parameter control how consecutive delimiters are
  interpreted?  Our thought is "no".  Consecutive delimiters should
  always denote an empty field.

- What about Unicode?  Is it sufficient to pass a file object gotten
  from codecs.open()?  For example::

    csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))

    csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))

  In the first example, text would be assumed to be encoded as cp1252.
  Should the system be aggressive in converting to Unicode or should
  Unicode strings only be returned if necessary?

  In the second example, the file will take care of automatically
  encoding Unicode strings as utf-8 before writing to disk.

- What about alternate escape conventions?  When Excel exports a file,
  it appears only the field delimiter needs to be escaped.  It
  accomplishes this by quoting the entire field, then doubling any
  quote characters which appear in the field.  It also quotes a field
  if the first character is a quote character.  It would seem we need
  to support two modes: escape-by-quoting and escape-by-prefix.  In
  addition, for the second mode, we'd have to specify the escape
  character (presumably defaulting to a backslash character).

- Should there be a "fully quoted" mode for writing?  What about
  "fully quoted except for numeric values"?

- What about end-of-line?  If I generate a CSV file on a Unix system,
  will Excel properly recognize the LF-only line terminators?

- What about conversion to other file formats?  Is the list-of-lists
  output from the csvreader sufficient to feed into other writers?

- What about an option to generate list-of-dict output from the reader
  and accept list-of-dicts by the writer?  This makes manipulating
  individual rows easier since each one is independent, but you lose
  field order when writing and have to tell the writer object the
  order the fields should appear in the file.

- Are quote character and delimiters limited to single characters?  I
  had a client not that long ago who wrote their own flat file format
  with a delimiter of ":::".

- How should rows of different lengths be handled?  The options seem
  to be:

  * raise an exception when a row is encountered whose length differs
    from the previous row

  * silently return short rows

  * allow the caller to specify the desired row length and what to do
    when rows of a different length are encountered: ignore, truncate,
    pad, raise exception, etc.


References
==========

.. [1] csv module, Object Craft
   (http://www.object-craft.com.au/projects/csv)

.. [2] Python-DSV module, Wells
   (http://sourceforge.net/projects/python-dsv/)

.. [3] ASV module, Tratt
   (http://tratt.net/laurie/python/asv/)

There are many references to other CSV-related projects on the Web.  A
few are included here.


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: