From 1cf7aea11c5bd1854b91c8f37262eb6738966ca3 Mon Sep 17 00:00:00 2001 From: Skip Montanaro Date: Fri, 31 Jan 2003 21:49:32 +0000 Subject: [PATCH] various cleanups expanded Rationale a tad added Post-History date (announcing it in a moment) added pointer to sandbox implementation mentioned implementation in the (massive ;-) Testing section --- pep-0305.txt | 123 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 80 insertions(+), 43 deletions(-) diff --git a/pep-0305.txt b/pep-0305.txt index 963a65a11..d94dbed72 100644 --- a/pep-0305.txt +++ b/pep-0305.txt @@ -11,7 +11,7 @@ Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-Jan-2003 -Post-History: +Post-History: 31-Jan-2003 Abstract @@ -24,7 +24,8 @@ stable specification and is subtle enough that parsing lines of a CSV file with something like ``line.split(",")`` is bound to fail. This PEP defines an API for reading and writing CSV files which should make it possible for programmers to select a CSV module which meets their -requirements. +requirements. It is accompanied by a corresponding module which +implements the API. To Do (Notes for the Interested and Ambitious) @@ -46,11 +47,11 @@ Existing Modules Three widely available modules enable programmers to read and write CSV files: -- Object Craft's CSV module [1]_ +- Object Craft's CSV module [2]_ -- Cliff Wells's Python-DSV module [2]_ +- Cliff Wells' Python-DSV module [3]_ -- Laurence Tratt's ASV module [3]_ +- Laurence Tratt's ASV module [4]_ Each has a different API, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret @@ -69,6 +70,21 @@ change. This PEP also forms a set of requirements for creation of a module which will hopefully be incorporated into the Python distribution. +CSV formats are not well-defined and different implementations have a +number of subtle corner cases. It has been suggested that the "V" in +the acronym stands for "Vague" instead of "Values". Different +delimiters and quoting characters are just the start. Some programs +generate whitespace after the delimiter. Others quote embedded +quoting characters by doubling them or prefixing them with an escape +character. The list of weird ways to do things seems nearly endless. + +Unfortunately, all this variability and subtlety means it is difficult +for programmers to reliably parse CSV files from many sources or +generate CSV files designed to be fed to specific external programs +without deep knowledge of those sources and programs. This PEP and +the software which accompany it attempt to make the process less +fragile. + Module Interface ================ @@ -76,7 +92,8 @@ Module Interface The module supports two basic APIs, one for reading and one for writing. The basic reading interface is:: - reader(fileobj [, dialect='excel2000'] [optional keyword args]) + obj = reader(fileobj [, dialect='excel2000'] + [optional keyword args]) A reader object is an iterable which takes a file-like object opened for reading as the sole required parameter. The optional dialect @@ -91,13 +108,13 @@ as follows:: The writing interface is similar:: - writer(fileobj [, dialect='excel2000'], [, fieldnames=list] - [optional keyword args]) + obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq] + [optional keyword args]) A writer object is a wrapper around a file-like object opened for writing. It accepts the same optional keyword parameters as the reader constructor. In addition, it accepts an optional fieldnames -argument. This is a list which defines the order of fields in the +argument. This is a sequence that defines the order of fields in the output file. It allows the write() method to accept mapping objects as well as sequence objects. @@ -115,6 +132,8 @@ programmer must explicitly write it, e.g.:: for row in someiterable: csvwriter.write(row) +or arrange for it to be the first row in the iterable being written. + Dialects -------- @@ -122,21 +141,21 @@ Dialects Readers and writers support a dialect argument which is just a convenient handle on a group of lower level parameters. -When dialect is a string it identifies one of the dialect which is +When dialect is a string it identifies one of the dialects which is known to the module, otherwise it is processed as a dialect class as described below. - + Dialects will generally be named after applications or organizations which define specific sets of format constraints. The initial dialect -is excel2000, which describes the format constraints of Excel 2000's -CSV format. Another possible dialect (used here only as an example) -might be "gnumeric". +is "excel", which describes the format constraints of Excel 97 and +Excel 2000 regarding CSV input and output. Another possible dialect +(used here only as an example) might be "gnumeric". -Dialects are implemented as attribute only classes to enable user to -construct variant dialects by subclassing. The excel2000 dialect is +Dialects are implemented as attribute only classes to enable users to +construct variant dialects by subclassing. The "excel" dialect is implemented as follows:: - class excel2000: + class excel: quotechar = '"' delimiter = ',' escapechar = None @@ -150,13 +169,17 @@ follows:: class exceltsv(csv.excel2000): delimiter = '\t' -Two functions are defined in the API to set and retrieve dialects:: +Three functions are defined in the API to set, get and list dialects:: set_dialect(name, dialect) dialect = get_dialect(name) + known_dialects = list_dialects() The dialect parameter is a class or instance whose attributes are the -formatting parameters defined in the next section. +formatting parameters defined in the next section. The +list_dialects() function returns all the registered dialect names as +given in previous set_dialect() calls (both predefined and +user-defined). Formatting Parameters @@ -167,54 +190,65 @@ formatting parameters, specified as keyword parameters. The parameters are also the keys for the input and output mapping objects for the set_dialect() and get_dialect() module functions. -- quotechar specifies a one-character string to use as the quoting +- ``quotechar`` specifies a one-character string to use as the quoting character. It defaults to '"'. -- delimiter specifies a one-character string to use as the field +- ``delimiter`` specifies a one-character string to use as the field separator. It defaults to ','. -- escapechar specifies a one character string used to escape the +- ``escapechar`` specifies a one character string used to escape the delimiter when quotechar is set to None. -- skipinitialspace specifies how to interpret whitespace which +- ``skipinitialspace`` specifies how to interpret whitespace which immediately follows a delimiter. It defaults to False, which means - that whitespace immediate following a delimiter is part of the + that whitespace immediately following a delimiter is part of the following field. -- lineterminator specifies the character sequence which should +- ``lineterminator`` specifies the character sequence which should terminate rows. -- quoting controls when quotes should be generated by the - writer. +- ``quoting`` controls when quotes should be generated by the + writer. It can take on any of the following module constants:: - "minimal" means only when required, for example, when a field - contains either the quotechar or the delimiter + csv.QUOTE_MINIMAL means only when required, for example, when a + field contains either the quotechar or the delimiter - "always" means that quotes are always placed around fields. + csv.QUOTE_ALL means that quotes are always placed around fields. - "nonnumeric" means that quotes are always placed around fields - which contain characters other than [+-0-9.]. + csv.QUOTE_NONNUMERIC means that quotes are always placed around + fields which contain characters other than [+-0-9.]. -... XXX More to come XXX ... +- ``doublequote`` (tbd) + +- are there more to come? When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed first, then the others are processed. This makes it easy to choose a dialect, -then override one or more of the settings. For example, if a CSV file -was generated by Excel 2000 using single quotes as the quote -character and TAB as the delimiter, you could create a reader like:: +then override one or more of the settings without defining a new +dialect class. For example, if a CSV file was generated by Excel 2000 +using single quotes as the quote character and TAB as the delimiter, +you could create a reader like:: - csvreader = csv.reader(file("some.csv"), dialect="excel2000", + csvreader = csv.reader(file("some.csv"), dialect="excel", quotechar="'", delimiter='\t') Other details of how Excel generates CSV files would be handled automatically. +Implementation +============== + +There is a sample implementation available. [1]_ The goal is for it +to efficiently implement the API described in the PEP. It is heavily +based on the Object Craft csv module. [2]_ + + Testing ======= -TBD. +The sample implementation [1]_ includes a set of test cases. @@ -283,13 +317,16 @@ Issues References ========== -.. [1] csv module, Object Craft - (http://www.object-craft.com.au/projects/csv) +.. [1] csv module, Python Sandbox + (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/) -.. [2] Python-DSV module, Wells - (http://sourceforge.net/projects/python-dsv/) +.. [2] csv module, Object Craft + (http://www.object-craft.com.au/projects/csv) -.. [3] ASV module, Tratt +.. [3] Python-DSV module, Wells + (http://sourceforge.net/projects/python-dsv/) + +.. [4] ASV module, Tratt (http://tratt.net/laurie/python/asv/) There are many references to other CSV-related projects on the Web. A