python-peps/pep-0358.txt

PEP: 358
Title: The "bytes" Object
Version: $Revision$
Last-Modified: $Date$
Author: Neil Schemenauer <nas@arctrix.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 15-Feb-2006
Python-Version: 2.5
Post-History:


Abstract

    This PEP outlines the introduction of a raw bytes sequence object.
    Adding the bytes object is one step in the transition to Unicode
    based str objects.


Motivation

    Python's current string objects are overloaded. They serve to hold
    both sequences of characters and sequences of bytes. This
    overloading of purpose leads to confusion and bugs. In future
    versions of Python, string objects will be used for holding
    character data. The bytes object will fulfil the role of a byte
    container. Eventually the unicode built-in will be renamed to str
    and the str object will be removed.


Specification

    A bytes object stores a mutable sequence of integers that are in the
    range 0 to 255.  Unlike string objects, indexing a bytes object
    returns an integer.  Assigning an element using a object that is not
    an integer causes a TypeError exception.  Assigning an element to a
    value outside the range 0 to 255 causes a ValueError exception.  The
    __len__ method of bytes returns the number of integers stored in the
    sequence (i.e. the number of bytes).

    The constructor of the bytes object has the following signature:

        bytes([initialiser[, [encoding]])

    If no arguments are provided then an object containing zero elements
    is created and returned.  The initialiser argument can be a string
    or a sequence of integers.  The pseudo-code for the constructor is:

        def bytes(initialiser=[], encoding=None):
            if isinstance(initialiser, basestring):
                if isinstance(initialiser, unicode):
                    if encoding is None:
                        encoding = sys.getdefaultencoding()
                    initialiser = initialiser.encode(encoding)
                initialiser = [ord(c) for c in initialiser]
            elif encoding is not None:
                raise TypeError("explicit encoding invalid for non-string "
                                "initialiser")
            create bytes object and fill with integers from initialiser
            return bytes object

    The __repr__ method returns a string that can be evaluated to
    generate a new bytes object containing the same sequence of
    integers.  The sequence is represented by a list of ints.  For
    example:

        >>> repr(bytes[10, 20, 30])
        'bytes([10, 20, 30])'

    The object has a decode method equivalent to the decode method of
    the str object.  The object has a classmethod fromhex that takes a
    string of characters from the set [0-9a-zA-Z ] and returns a bytes
    object (similar to binascii.unhexlify).  For example:

        >>> bytes.fromhex('5c5350ff')
        bytes([92, 83, 80, 255]])
        >>> bytes.fromhex('5c 53 50 ff')
        bytes([92, 83, 80, 255]])

    The object has a hex method that does the reverse conversion
    (similar to binascii.hexlify):

        >> bytes([92, 83, 80, 255]]).hex()
        '5c5350ff'

    The bytes object has methods similar to the list object:

        __add__
        __contains__
        __delitem__
        __delslice__
        __eq__
        __ge__
        __getitem__
        __getslice__
        __gt__
        __hash__
        __iadd__
        __imul__
        __iter__
        __le__
        __len__
        __lt__
        __mul__
        __ne__
        __reduce__
        __reduce_ex__
        __repr__
        __rmul__
        __setitem__
        __setslice__
        append
        count
        extend
        index
        insert
        pop
        remove


Out of scope issues

    * If we provide a literal syntax for bytes then it should look
      distinctly different than the syntax for literal strings.  Also, a
      new type, even built-in, is much less drastic than a new literal
      (which requires lexer and parser support in addition to everything
      else).  Since there appears to be no immediate need for a literal
      representation, designing and implementing one is out of the scope
      of this PEP.

    * Python 3k will have a much different I/O subsystem.  Deciding how
      that I/O subsystem will work and interact with the bytes object is
      out of the scope of this PEP.

    * It has been suggested that a special method named __bytes__ be
      added to language to allow objects to be converted into byte
      arrays.  This decision is out of scope.


Unresolved issues

    * Perhaps the bytes object should be implemented as a extension
      module until we are more sure of the design (similar to how the
      set object was prototyped).

    * Should the bytes object implement the buffer interface?  Probably,
      but we need to look into the implications of that (e.g. regex
      operations on byte arrays).

    * Should the object implement __reversed__ and reverse?  Should it
      implement sort?

    * Need to clarify what some of the methods do.  How are comparisons
      done?  Hashing?  Pickling and marshalling?


Questions and answers

    Q: Why have the optional encoding argument when the encode method of
       Unicode objects does the same thing.

    A: In the current version of Python, the encode method returns a str
       object and we cannot change that without breaking code.  The
       construct bytes(s.encode(...)) is expensive because it has to
       copy the byte sequence multiple times.  Also, Python generally
       provides two ways of converting an object of type A into an
       object of type B: ask an A instance to convert itself to a B, or
       ask the type B to create a new instance from an A. Depending on
       what A and B are, both APIs make sense; sometimes reasons of
       decoupling require that A can't know about B, in which case you
       have to use the latter approach; sometimes B can't know about A,
       in which case you have to use the former.


    Q: Why does bytes ignore the encoding argument if the initialiser is
       a str?

    A: There is no sane meaning that the encoding can have in that case.
       str objects *are* byte arrays and they know nothing about the
       encoding of character data they contain.  We need to assume that
       the programmer has provided str object that already uses the
       desired encoding. If you need something other than a pure copy of
       the bytes then you need to first decode the string.  For example:

           bytes(s.decode(encoding1), encoding2)


    Q: Why not have the encoding argument default to Latin-1 (or some
       other encoding that covers the entire byte range) rather than
       ASCII?

    A: The system default encoding for Python is ASCII.  It seems least
       confusing to use that default.  Also, in Py3k, using Latin-1 as
       the default might not be what users expect.  For example, they
       might prefer a Unicode encoding.  Any default will not always
       work as expected.  At least ASCII will complain loudly if you try
       to encode non-ASCII data.


Copyright

    This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   End: