a number of changes to keep more-or-less up-to-date.

2003-02-10 04:03:49 +00:00 · 2003-02-10 04:03:49 +00:00 · 67273248cc
parent 1dc70cf85e
commit 67273248cc
1 changed files with 233 additions and 165 deletions
--- a/pep-0305.txt
+++ b/pep-0305.txt
@ -18,31 +18,77 @@ The Comma Separated Values (CSV) file format is the most common import
 and export format for spreadsheets and databases.  Although many CSV
 files are simple to parse, the format is not formally defined by a
 stable specification and is subtle enough that parsing lines of a CSV
-file with something like ``line.split(",")`` is bound to fail.  This
+file with something like ``line.split(",")`` is eventually bound to
-PEP defines an API for reading and writing CSV files which should make
+fail.  This PEP defines an API for reading and writing CSV files.  It
-it possible for programmers to select a CSV module which meets their
+is accompanied by a corresponding module which implements the API.
 requirements.  It is accompanied by a corresponding module which
 implements the API.
 To Do (Notes for the Interested and Ambitious)
 ==============================================
 - Need to better explain the advantages of a purpose-built csv module
  over the simple ",".join() and [].split() approach.
 - Need to complete initial list of formatting parameters and settle on
  names.
 - Better motivation for the choice of passing a file object to the
  constructors.  See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html
 - Unicode.  ugh.
 Application Domain
 ==================
 This PEP is about doing one thing well: parsing tabular data which may
 use a variety of field separators, quoting characters, quote escape
 mechanisms and line endings.  The authors intend the proposed module
 to solve this one parsing problem efficiently.  The authors do not
 intend to address any of these related topics::
    - data interpretation (is a field containing the string "10"
      supposed to be a string, a float or an int? is it a number in
      base 10, base 16 or base 2? is a number in quotes a number or a
      string?)
    - locale-specific data representation (should the number 1.23 be
      written as "1.23" or "1,23" or "1 23"?) -- this may eventually
      be addressed.
    - fixed width tabular data - can already be parsed reliably.
 Rationale
 =========
 Often, CSV files are formatted simply enough that you can get by
 reading them line-by-line and splitting on the commas which delimit
 the fields.  This is especially true if all the data being read is
 numeric.  This approach may work for awhile, then come back to bite
 you in the butt when somebody puts something unexpected in the data
 like a comma.  As you dig into the problem you may eventually come to
 the conclusion that you can solve the problem using regular
 expressions.  This will work for awhile, then break mysteriously one
 day.  The problem grows, so you dig deeper and eventually realize that
 you need a purpose-built parser for the format.
 CSV formats are not well-defined and different implementations have a
 number of subtle corner cases.  It has been suggested that the "V" in
 the acronym stands for "Vague" instead of "Values".  Different
 delimiters and quoting characters are just the start.  Some programs
 generate whitespace after each delimiter which is not part of the
 following field.  Others quote embedded quoting characters by doubling
 them, others by prefixing them with an escape character.  The list of
 weird ways to do things can seem endless.
 All this variability means it is difficult for programmers to reliably
 parse CSV files from many sources or generate CSV files designed to be
 fed to specific external programs without a thorough understanding of
 those sources and programs.  This PEP and the software which accompany
 it attempt to make the process less fragile.
 Existing Modules
 ================
-Three widely available modules enable programmers to read and write
+This problem has been tackled before.  At least three modules
-CSV files:
+currently available in the Python community enable programmers to read
 and write CSV files:
 - Object Craft's CSV module [2]_
@ -53,73 +99,65 @@ CSV files:
 Each has a different API, making it somewhat difficult for programmers
 to switch between them.  More of a problem may be that they interpret
 some of the CSV corner cases differently, so even after surmounting
-the differences in the module APIs, the programmer has to also deal
+the differences between the different module APIs, the programmer has
-with semantic differences between the packages.
+to also deal with semantic differences between the packages.
 Rationale
 =========
 By defining common APIs for reading and writing CSV files, we make it
 easier for programmers to choose an appropriate module to suit their
 needs, and make it easier to switch between modules if their needs
 change.  This PEP also forms a set of requirements for creation of a
 module which will hopefully be incorporated into the Python
 distribution.
 CSV formats are not well-defined and different implementations have a
 number of subtle corner cases.  It has been suggested that the "V" in
 the acronym stands for "Vague" instead of "Values".  Different
 delimiters and quoting characters are just the start.  Some programs
 generate whitespace after the delimiter.  Others quote embedded
 quoting characters by doubling them or prefixing them with an escape
 character.  The list of weird ways to do things seems nearly endless.
 Unfortunately, all this variability and subtlety means it is difficult
 for programmers to reliably parse CSV files from many sources or
 generate CSV files designed to be fed to specific external programs
 without deep knowledge of those sources and programs.  This PEP and
 the software which accompany it attempt to make the process less
 fragile.
 Module Interface
 ================
-The module supports two basic APIs, one for reading and one for
+This PEP supports three basic APIs, one to read and parse CSV files,
-writing.  The basic reading interface is::
+one to write them, and one to identify different CSV dialects to the
 readers and writers.
 Reading CSV Files
 -----------------
 CSV readers are created with the reader factory function::
    obj = reader(iterable [, dialect='excel']
                 [optional keyword args])
-A reader object is an iterable which takes an interable object which
+A reader object is an iterator which takes an iterable object
-returns lines as the sole required parameter.  The optional dialect
+returning lines as the sole required parameter.  If it supports a
-parameter is discussed below.  It also accepts several optional
+binary mode (file objects do), the iterable argument to the reader
-keyword arguments which define specific format settings for the parser
+function must have been opened in binary mode.  This gives the reader
-(see the section "Formatting Parameters").  Readers are typically used
+object full control over the interpretation of the file's contents.
-as follows::
+The optional dialect parameter is discussed below.  The reader
 function also accepts several optional keyword arguments which define
 specific format settings for the parser (see the section "Formatting
 Parameters").  Readers are typically used as follows::
    csvreader = csv.reader(file("some.csv"))
    for row in csvreader:
        process(row)
-The writing interface is similar::
+Each row returned by a reader object is a list of strings or Unicode
 objects.
-    obj = writer(fileobj [, dialect='excel'], [, fieldnames=seq]
+When both a dialect parameter and individual formatting parameters are
 passed to the constructor, first the dialect is queried for formatting
 parameters, then individual formatting parameters are examined.
 Writing CSV Files
 -----------------
 Creating writers is similar::
    obj = writer(fileobj [, dialect='excel'],
                 [optional keyword args])
 A writer object is a wrapper around a file-like object opened for
-writing.  It accepts the same optional keyword parameters as the
+writing in binary mode (if such a distinction is made).  It accepts
-reader constructor.  In addition, it accepts an optional fieldnames
+the same optional keyword parameters as the reader constructor.
 argument.  This is a sequence that defines the order of fields in the
 output file.  It allows the write() method to accept mapping objects
 as well as sequence objects.
 Writers are typically used as follows::
    csvwriter = csv.writer(file("some.csv", "w"))
    for row in someiterable:
-        csvwriter.write(row)
+        csvwriter.writerow(row)
 To generate a set of field names as the first row of the CSV file, the
 programmer must explicitly write it, e.g.::
@ -132,111 +170,150 @@ programmer must explicitly write it, e.g.::
 or arrange for it to be the first row in the iterable being written.
-Dialects
+Managing Different Dialects
--------
+---------------------------
-Readers and writers support a dialect argument which is just a
+Because CSV is a somewhat ill-defined format, there are plenty of ways
-convenient handle on a group of lower level parameters.
+one CSV file can differ from another, yet contain exactly the same
 data.  Many tools which can import or export tabular data allow the
 user to indicate the field delimiter, quote character, line
 terminator, and other characteristics of the file.  These can be
 fairly easily determined, but are still mildly annoying to figure out,
 and make for fairly long function calls when specified individually.
-When dialect is a string it identifies one of the dialects which is
+To try and minimize the difficulty of figuring out and specifying a
-known to the module, otherwise it is processed as a dialect class as
+bunch of formatting parameters, reader and writer objects support a
-described below.
+dialect argument which is just a convenient handle on a group of these
 lower level parameters.  When a dialect is given as a string it
 identifies one of the dialects known to the module via its
 registration functions, otherwise it must be an instance of the
 Dialect class as described below.
 Dialects will generally be named after applications or organizations
-which define specific sets of format constraints.  The initial dialect
+which define specific sets of format constraints.  Two dialects are
-is "excel", which describes the format constraints of Excel 97 and
+defined in the module as of this writing, "excel", which describes the
-Excel 2000 regarding CSV input and output.  Another possible dialect
+default format constraints for CSV file export by Excel 97 and Excel
-(used here only as an example) might be "gnumeric".
+2000, and "excel-tab", which is the same as "excel" but specifies an
 ASCII TAB character as the field delimiter.
 Dialects are implemented as attribute only classes to enable users to
-construct variant dialects by subclassing.  The "excel" dialect is
+construct variant dialects by subclassing.  The "excel" dialect is a
-implemented as follows::
+subclass of Dialect and is defined as follows::
-    class excel:
+    class Dialect:
        # placeholders
        delimiter = None
        quotechar = None
        escapechar = None
        doublequote = None
        skipinitialspace = None
        lineterminator = None
        quoting = None
    class excel(Dialect):
        delimiter = ','
        quotechar = '"'
        escapechar = None
        doublequote = True
        skipinitialspace = False
        lineterminator = '\r\n'
        quoting = QUOTE_MINIMAL
-An excel tab separated dialect can then be defined in user code as
+The "excel-tab" dialect is defined as::
 follows::
-    class exceltsv(csv.excel):
+    class exceltsv(excel):
        delimiter = '\t'
-Three functions are defined in the API to set, get and list dialects::
+(For a description of the individual formatting parameters see the
 section "Formatting Parameters".)
 To enable string references to specific dialects, the module defines
 several functions::
    register_dialect(name, dialect)
    dialect = get_dialect(name)
-    known_dialects = list_dialects()
+    names = list_dialects()
    register_dialect(name, dialect)
    unregister_dialect(name)
-The dialect parameter is a class or instance whose attributes are the
+``get_dialect()`` returns the dialect instance associated with the
-formatting parameters defined in the next section.  The
+given name.  ``list_dialects()`` returns a list of all registered
-list_dialects() function returns all the registered dialect names as
+dialect names.  ``register_dialects()`` associates a string name with
-given in previous register_dialect() calls (both predefined and
+a dialect class.  ``unregister_dialect()`` deletes a name/dialect
-user-defined).
+association. 
 Formatting Parameters
 ---------------------
-Both the reader and writer constructors take several specific
+In addition to the dialect argument, both the reader and writer
-formatting parameters, specified as keyword parameters.
+constructors take several specific formatting parameters, specified as
 keyword parameters.  The formatting parameters understood are::
- ``quotechar`` specifies a one-character string to use as the quoting
+    - ``quotechar`` specifies a one-character string to use as the
-  character.  It defaults to '"'.  Setting this to None has the same
+      quoting character.  It defaults to '"'.  Setting this to None
-  effect as setting quoting to csv.QUOTE_NONE.
+      has the same effect as setting quoting to csv.QUOTE_NONE.
- ``delimiter`` specifies a one-character string to use as the field
+    - ``delimiter`` specifies a one-character string to use as the
-  separator.  It defaults to ','.
+      field separator.  It defaults to ','.
- ``escapechar`` specifies a one-character string used to escape the
+    - ``escapechar`` specifies a one-character string used to escape
-  delimiter when quotechar is set to None.
+      the delimiter when quotechar is set to None.
- ``skipinitialspace`` specifies how to interpret whitespace which
+    - ``skipinitialspace`` specifies how to interpret whitespace which
-  immediately follows a delimiter.  It defaults to False, which means
+      immediately follows a delimiter.  It defaults to False, which
-  that whitespace immediately following a delimiter is part of the
+      means that whitespace immediately following a delimiter is part
-  following field.
+      of the following field.
- ``lineterminator`` specifies the character sequence which should
+    - ``lineterminator`` specifies the character sequence which should
-  terminate rows.
+      terminate rows.
- ``quoting`` controls when quotes should be generated by the
+    - ``quoting`` controls when quotes should be generated by the
-  writer.  It can take on any of the following module constants::
+      writer.  It can take on any of the following module constants::
-    csv.QUOTE_MINIMAL means only when required, for example, when a
+        * csv.QUOTE_MINIMAL means only when required, for example,
-    field contains either the quotechar or the delimiter
+          when a field contains either the quotechar or the delimiter
-    csv.QUOTE_ALL means that quotes are always placed around fields.
+        * csv.QUOTE_ALL means that quotes are always placed around
          fields.
-    csv.QUOTE_NONNUMERIC means that quotes are always placed around
+        * csv.QUOTE_NONNUMERIC means that quotes are always placed
-    fields which contain characters other than [+-0-9.].
+          around nonnumeric fields.
-    csv.QUOTE_NONE means that quotes are never placed around
+        * csv.QUOTE_NONE means that quotes are never placed around
-    fields.
+          fields.
- ``doublequote`` controls the handling of quotes inside fields.  When
+    - ``doublequote`` controls the handling of quotes inside fields.
-  True two consecutive quotes are interpreted as one during read, and
+      When True two consecutive quotes are interpreted as one during
-  when writing, each quote is written as two quotes.
+      read, and when writing, each quote is written as two quotes.
 - are there more to come?
 When processing a dialect setting and one or more of the other
-optional parameters, the dialect parameter is processed first, then
+optional parameters, the dialect parameter is processed before the
-the others are processed.  This makes it easy to choose a dialect,
+individual formatting parameters.  This makes it easy to choose a
-then override one or more of the settings without defining a new
+dialect, then override one or more of the settings without defining a
-dialect class.  For example, if a CSV file was generated by Excel 2000
+new dialect class.  For example, if a CSV file was generated by Excel
-using single quotes as the quote character and TAB as the delimiter,
+2000 using single quotes as the quote character and a colon as the
-you could create a reader like::
+delimiter, you could create a reader like::
    csvreader = csv.reader(file("some.csv"), dialect="excel",
-                           quotechar="'", delimiter='\t')
+                           quotechar="'", delimiter=':')
 Other details of how Excel generates CSV files would be handled
-automatically.
+automatically because of the reference to the "excel" dialect.
 Reader Objects
 --------------
 Reader objects are iterables whose next() method returns a sequence of
 strings, one string per field in the row.
 Writer Objects
 --------------
 Writer objects have two methods, writerow() and writerows().  The
 former accepts an iterable (typically a list) of fields which are to
 be written to the output.  The latter accepts a list of iterables and
 calls writerow() for each.
 Implementation
@ -257,63 +334,54 @@ The sample implementation [1]_ includes a set of test cases.
 Issues
 ======
- Should a parameter control how consecutive delimiters are
+1. Should a parameter control how consecutive delimiters are
-  interpreted?  Our thought is "no".  Consecutive delimiters should
+   interpreted?  Our thought is "no".  Consecutive delimiters should
-  always denote an empty field.
+   always denote an empty field.
- What about Unicode?  Is it sufficient to pass a file object gotten
+2. What about Unicode?  Is it sufficient to pass a file object gotten
-  from codecs.open()?  For example::
+   from codecs.open()?  For example::
-    csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
+     csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252"))
-    csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
+     csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8"))
-  In the first example, text would be assumed to be encoded as cp1252.
+   In the first example, text would be assumed to be encoded as cp1252.
-  Should the system be aggressive in converting to Unicode or should
+   Should the system be aggressive in converting to Unicode or should
-  Unicode strings only be returned if necessary?
+   Unicode strings only be returned if necessary?
-  In the second example, the file will take care of automatically
+   In the second example, the file will take care of automatically
-  encoding Unicode strings as utf-8 before writing to disk.
+   encoding Unicode strings as utf-8 before writing to disk.
- What about alternate escape conventions?  When Excel exports a file,
+   Note: As of this writing, the csv module doesn't handle Unicode
-  it appears only the field delimiter needs to be escaped.  It
+   data.
  accomplishes this by quoting the entire field, then doubling any
  quote characters which appear in the field.  It also quotes a field
  if the first character is a quote character.  It would seem we need
  to support two modes: escape-by-quoting and escape-by-prefix.  In
  addition, for the second mode, we'd have to specify the escape
  character (presumably defaulting to a backslash character).
- Should there be a "fully quoted" mode for writing?  What about
+3. What about alternate escape conventions?  If the dialect in use
-  "fully quoted except for numeric values"?
+   includes an ``escapechar`` parameter which is not None and the
   ``quoting`` parameter is set to QUOTE_NONE, delimiters appearing
   within fields will be prefixed by the escape character when writing
   and are expected to be prefixed by the escape character when
   reading.
- What about end-of-line?  If I generate a CSV file on a Unix system,
+4. Should there be a "fully quoted" mode for writing?  What about
-  will Excel properly recognize the LF-only line terminators?
+   "fully quoted except for numeric values"?  Both are implemented
   (QUOTE_ALL and QUOTE_NONNUMERIC, respectively).
- What about conversion to other file formats?  Is the list-of-lists
+5. What about end-of-line?  If I generate a CSV file on a Unix system,
-  output from the csvreader sufficient to feed into other writers?
+   will Excel properly recognize the LF-only line terminators?  Files
   must be opened for reading or writing as appropriate using binary
   mode.  Specify the ``lineterminator`` sequence as '\r\n'.  The
   resulting file will be written correctly.
- What about an option to generate list-of-dict output from the reader
+6. What about an option to generate dicts from the reader and accept
-  and accept list-of-dicts by the writer?  This makes manipulating
+   dicts by the writer?  See the DictReader and DictWriter classes in
-  individual rows easier since each one is independent, but you lose
+   csv.py.
  field order when writing and have to tell the writer object the
  order the fields should appear in the file.
- Are quote character and delimiters limited to single characters?  I
+8. Are quote character and delimiters limited to single characters?
-  had a client not that long ago who wrote their own flat file format
+   For the time being, yes.
  with a delimiter of ":::".
- How should rows of different lengths be handled?  The options seem
+9. How should rows of different lengths be handled?  Interpretation of
-  to be:
+   the data is the application's job.  There is no such thing as a
-
+   "short row" or a "long row" at this level.
  * raise an exception when a row is encountered whose length differs
    from the previous row
  * silently return short rows
  * allow the caller to specify the desired row length and what to do
    when rows of a different length are encountered: ignore, truncate,
    pad, raise exception, etc.
 References