PEP: 305 Title: CSV File API Version: $Revision$ Last-Modified: $Date$ Author: Kevin Altis, Dave Cole, Andrew McNamara, Skip Montanaro, Cliff Wells Discussions-To: Status: Draft Type: Standards Track Content-Type: text/x-rst Created: 26-Jan-2003 Post-History: 31-Jan-2003 Abstract ======== The Comma Separated Values (CSV) file format is the most common import and export format for spreadsheets and databases. Although many CSV files are simple to parse, the format is not formally defined by a stable specification and is subtle enough that parsing lines of a CSV file with something like ``line.split(",")`` is bound to fail. This PEP defines an API for reading and writing CSV files which should make it possible for programmers to select a CSV module which meets their requirements. It is accompanied by a corresponding module which implements the API. To Do (Notes for the Interested and Ambitious) ============================================== - Need to better explain the advantages of a purpose-built csv module over the simple ",".join() and [].split() approach. - Need to complete initial list of formatting parameters and settle on names. - Better motivation for the choice of passing a file object to the constructors. See http://manatee.mojam.com/pipermail/csv/2003-January/000179.html Existing Modules ================ Three widely available modules enable programmers to read and write CSV files: - Object Craft's CSV module [2]_ - Cliff Wells' Python-DSV module [3]_ - Laurence Tratt's ASV module [4]_ Each has a different API, making it somewhat difficult for programmers to switch between them. More of a problem may be that they interpret some of the CSV corner cases differently, so even after surmounting the differences in the module APIs, the programmer has to also deal with semantic differences between the packages. Rationale ========= By defining common APIs for reading and writing CSV files, we make it easier for programmers to choose an appropriate module to suit their needs, and make it easier to switch between modules if their needs change. This PEP also forms a set of requirements for creation of a module which will hopefully be incorporated into the Python distribution. CSV formats are not well-defined and different implementations have a number of subtle corner cases. It has been suggested that the "V" in the acronym stands for "Vague" instead of "Values". Different delimiters and quoting characters are just the start. Some programs generate whitespace after the delimiter. Others quote embedded quoting characters by doubling them or prefixing them with an escape character. The list of weird ways to do things seems nearly endless. Unfortunately, all this variability and subtlety means it is difficult for programmers to reliably parse CSV files from many sources or generate CSV files designed to be fed to specific external programs without deep knowledge of those sources and programs. This PEP and the software which accompany it attempt to make the process less fragile. Module Interface ================ The module supports two basic APIs, one for reading and one for writing. The basic reading interface is:: obj = reader(fileobj [, dialect='excel2000'] [optional keyword args]) A reader object is an iterable which takes a file-like object opened for reading as the sole required parameter. The optional dialect parameter is discussed below. It also accepts several optional keyword arguments which define specific format settings for the parser (see the section "Formatting Parameters"). Readers are typically used as follows:: csvreader = csv.reader(file("some.csv")) for row in csvreader: process(row) The writing interface is similar:: obj = writer(fileobj [, dialect='excel2000'], [, fieldnames=seq] [optional keyword args]) A writer object is a wrapper around a file-like object opened for writing. It accepts the same optional keyword parameters as the reader constructor. In addition, it accepts an optional fieldnames argument. This is a sequence that defines the order of fields in the output file. It allows the write() method to accept mapping objects as well as sequence objects. Writers are typically used as follows:: csvwriter = csv.writer(file("some.csv", "w")) for row in someiterable: csvwriter.write(row) To generate a set of field names as the first row of the CSV file, the programmer must explicitly write it, e.g.:: csvwriter = csv.writer(file("some.csv", "w"), fieldnames=names) csvwriter.write(names) for row in someiterable: csvwriter.write(row) or arrange for it to be the first row in the iterable being written. Dialects -------- Readers and writers support a dialect argument which is just a convenient handle on a group of lower level parameters. When dialect is a string it identifies one of the dialects which is known to the module, otherwise it is processed as a dialect class as described below. Dialects will generally be named after applications or organizations which define specific sets of format constraints. The initial dialect is "excel", which describes the format constraints of Excel 97 and Excel 2000 regarding CSV input and output. Another possible dialect (used here only as an example) might be "gnumeric". Dialects are implemented as attribute only classes to enable users to construct variant dialects by subclassing. The "excel" dialect is implemented as follows:: class excel: quotechar = '"' delimiter = ',' escapechar = None skipinitialspace = False lineterminator = '\r\n' quoting = QUOTE_MINIMAL An excel tab separated dialect can then be defined in user code as follows:: class exceltsv(csv.excel2000): delimiter = '\t' Three functions are defined in the API to set, get and list dialects:: set_dialect(name, dialect) dialect = get_dialect(name) known_dialects = list_dialects() The dialect parameter is a class or instance whose attributes are the formatting parameters defined in the next section. The list_dialects() function returns all the registered dialect names as given in previous set_dialect() calls (both predefined and user-defined). Formatting Parameters --------------------- Both the reader and writer constructors take several specific formatting parameters, specified as keyword parameters. The parameters are also the keys for the input and output mapping objects for the set_dialect() and get_dialect() module functions. - ``quotechar`` specifies a one-character string to use as the quoting character. It defaults to '"'. - ``delimiter`` specifies a one-character string to use as the field separator. It defaults to ','. - ``escapechar`` specifies a one character string used to escape the delimiter when quotechar is set to None. - ``skipinitialspace`` specifies how to interpret whitespace which immediately follows a delimiter. It defaults to False, which means that whitespace immediately following a delimiter is part of the following field. - ``lineterminator`` specifies the character sequence which should terminate rows. - ``quoting`` controls when quotes should be generated by the writer. It can take on any of the following module constants:: csv.QUOTE_MINIMAL means only when required, for example, when a field contains either the quotechar or the delimiter csv.QUOTE_ALL means that quotes are always placed around fields. csv.QUOTE_NONNUMERIC means that quotes are always placed around fields which contain characters other than [+-0-9.]. - ``doublequote`` (tbd) - are there more to come? When processing a dialect setting and one or more of the other optional parameters, the dialect parameter is processed first, then the others are processed. This makes it easy to choose a dialect, then override one or more of the settings without defining a new dialect class. For example, if a CSV file was generated by Excel 2000 using single quotes as the quote character and TAB as the delimiter, you could create a reader like:: csvreader = csv.reader(file("some.csv"), dialect="excel", quotechar="'", delimiter='\t') Other details of how Excel generates CSV files would be handled automatically. Implementation ============== There is a sample implementation available. [1]_ The goal is for it to efficiently implement the API described in the PEP. It is heavily based on the Object Craft csv module. [2]_ Testing ======= The sample implementation [1]_ includes a set of test cases. Issues ====== - Should a parameter control how consecutive delimiters are interpreted? Our thought is "no". Consecutive delimiters should always denote an empty field. - What about Unicode? Is it sufficient to pass a file object gotten from codecs.open()? For example:: csvreader = csv.reader(codecs.open("some.csv", "r", "cp1252")) csvwriter = csv.writer(codecs.open("some.csv", "w", "utf-8")) In the first example, text would be assumed to be encoded as cp1252. Should the system be aggressive in converting to Unicode or should Unicode strings only be returned if necessary? In the second example, the file will take care of automatically encoding Unicode strings as utf-8 before writing to disk. - What about alternate escape conventions? When Excel exports a file, it appears only the field delimiter needs to be escaped. It accomplishes this by quoting the entire field, then doubling any quote characters which appear in the field. It also quotes a field if the first character is a quote character. It would seem we need to support two modes: escape-by-quoting and escape-by-prefix. In addition, for the second mode, we'd have to specify the escape character (presumably defaulting to a backslash character). - Should there be a "fully quoted" mode for writing? What about "fully quoted except for numeric values"? - What about end-of-line? If I generate a CSV file on a Unix system, will Excel properly recognize the LF-only line terminators? - What about conversion to other file formats? Is the list-of-lists output from the csvreader sufficient to feed into other writers? - What about an option to generate list-of-dict output from the reader and accept list-of-dicts by the writer? This makes manipulating individual rows easier since each one is independent, but you lose field order when writing and have to tell the writer object the order the fields should appear in the file. - Are quote character and delimiters limited to single characters? I had a client not that long ago who wrote their own flat file format with a delimiter of ":::". - How should rows of different lengths be handled? The options seem to be: * raise an exception when a row is encountered whose length differs from the previous row * silently return short rows * allow the caller to specify the desired row length and what to do when rows of a different length are encountered: ignore, truncate, pad, raise exception, etc. References ========== .. [1] csv module, Python Sandbox (http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python/python/nondist/sandbox/csv/) .. [2] csv module, Object Craft (http://www.object-craft.com.au/projects/csv) .. [3] Python-DSV module, Wells (http://sourceforge.net/projects/python-dsv/) .. [4] ASV module, Tratt (http://tratt.net/laurie/python/asv/) There are many references to other CSV-related projects on the Web. A few are included here. Copyright ========= This document has been placed in the public domain. .. Local Variables: mode: indented-text indent-tabs-mode: nil sentence-end-double-space: t fill-column: 70 End: