Another update, clarifying (I hope) the method signatures and mentioning

other stuff that came up over dinner.
2007-02-23 04:31:15 +00:00 · 2007-02-23 04:31:15 +00:00 · 9ca7df06ee
parent 3ed8bc79de
commit 9ca7df06ee
1 changed files with 136 additions and 107 deletions
--- a/pep-0358.txt
+++ b/pep-0358.txt
@ -13,9 +13,16 @@ Post-History:

 Abstract

-    This PEP outlines the introduction of a raw bytes sequence object.
-    Adding the bytes object is one step in the transition to Unicode
-    based str objects.
+    This PEP outlines the introduction of a raw bytes sequence type.
+    Adding the bytes type is one step in the transition to Unicode
+    based str objects which will be introduced in Python 3.0.
+
+    The PEP describes how the bytes type should work in Python 2.6, as
+    well as how it should work in Python 3.0.  (Occasionally there are
+    differences because in Python 2.6, we have two string types, str
+    and unicode, while in Python 3.0 we will only have one string
+    type, whose name will be str but whose semantics will be like the
+    2.6 unicode type.)


 Motivation
@ -33,39 +40,48 @@ Specification

    A bytes object stores a mutable sequence of integers that are in
    the range 0 to 255.  Unlike string objects, indexing a bytes
-    object returns an integer.  Assigning an element using a object
-    that is not an integer causes a TypeError exception.  Assigning an
-    element to a value outside the range 0 to 255 causes a ValueError
-    exception.  The .__len__() method of bytes returns the number of
-    integers stored in the sequence (i.e. the number of bytes).
+    object returns an integer.  Assigning or comparin an object that
+    is not an integer to an element causes a TypeError exception.
+    Assigning an element to a value outside the range 0 to 255 causes
+    a ValueError exception.  The .__len__() method of bytes returns
+    the number of integers stored in the sequence (i.e. the number of
+    bytes).

    The constructor of the bytes object has the following signature:

-        bytes([initialiser[, [encoding]])
+        bytes([initializer[, encoding]])

-    If no arguments are provided then an object containing zero elements
-    is created and returned.  The initialiser argument can be a string,
-    a sequence of integers, or a single integer.  The pseudo-code for the
-    constructor is:
+    If no arguments are provided then a bytes object containing zero
+    elements is created and returned.  The initializer argument can be
+    a string (in 2.6, either str or unicode), an iterable of integers,
+    or a single integer.  The pseudo-code for the constructor
+    (optimized for clear semantics, not for speed) is:

-        def bytes(initialiser=[], encoding=None):
-            if isinstance(initialiser, int): # In 2.6, (int, long)
-                initialiser = [0]*initialiser
-            elif isinstance(initialiser, basestring):
-                if isinstance(initialiser, unicode): # In 3.0, always
+        def bytes(initializer=0, encoding=None):
+            if isinstance(initializer, int): # In 2.6, (int, long)
+                initializer = [0]*initializer
+            elif isinstance(initializer, basestring):
+                if isinstance(initializer, unicode): # In 3.0, always
                    if encoding is None:
                        # In 3.0, raise TypeError("explicit encoding required")
                        encoding = sys.getdefaultencoding()
-                    initialiser = initialiser.encode(encoding)
-                initialiser = [ord(c) for c in initialiser]
+                    initializer = initializer.encode(encoding)
+                initializer = [ord(c) for c in initializer]
            else:
                if encoding is not None:
-                    raise TypeError("explicit encoding invalid for non-string "
-                                    "initialiser")
-            # Create bytes object and fill with integers from initialiser
-            # while ensuring each integer is in range(256); initialiser
-            # can be any iterable
-            return bytes object
+                    raise TypeError("no encoding allowed for this initializer")
+                tmp = []
+                for c in initializer:
+                    if not isinstance(c, int):
+                        raise TypeError("initializer must be iterable of ints")
+                    if not 0 <= c < 256:
+                        raise ValueError("initializer element out of range")
+                    tmp.append(c)
+                initializer = tmp
+            new = <new bytes object of length len(initializer)>
+            for i, c in enumerate(initializer):
+                new[i] = c
+            return new

    The .__repr__() method returns a string that can be evaluated to
    generate a new bytes object containing the same sequence of
@ -76,13 +92,10 @@ Specification
        'bytes([0x0a, 0x14, 0x1e])'

    The object has a .decode() method equivalent to the .decode()
-    method of the str object.  (This is redundant since it can also be
-    decoded by calling unicode(b, <encoding>) (in 2.6) or str(b,
-    <encoding>) (in 3.0); do we need encode/decode methods?  In a
-    sense the spelling using a constructor is cleaner.)  The object
-    has a classmethod .fromhex() that takes a string of characters
-    from the set [0-9a-zA-Z ] and returns a bytes object (similar to
-    binascii.unhexlify).  For example:
+    method of the str object.  The object has a classmethod .fromhex()
+    that takes a string of characters from the set [0-9a-zA-Z ] and
+    returns a bytes object (similar to binascii.unhexlify).  For
+    example:

        >>> bytes.fromhex('5c5350ff')
        bytes([92, 83, 80, 255]])
@ -96,102 +109,118 @@ Specification
        '5c5350ff'

    The bytes object has some methods similar to list method, and
-    others similar to str methods:
+    others similar to str methods.  Here is a complete list of
+    methods, with their approximate signatures:

-        __add__
-        __contains__ (with int arg, like list; with bytes arg, like str)
-        __delitem__
-        __delslice__
-        __eq__
-        __ge__
-        __getitem__
-        __getslice__
-        __gt__
-        __iadd__
-        __imul__
-        __iter__
-        __le__
-        __len__
-        __lt__
-        __mul__
-        __ne__
-        __reduce__
-        __reduce_ex__
-        __repr__
-        __reversed__
-        __rmul__
-        __setitem__
-        __setslice__
-        append
-        count
-        decode
-        endswith
-        extend
-        find
-        index
-        insert
-        join
-        partition
-        pop
-        remove
-        replace
-        rindex
-        rpartition
-        split
-        startswith
-        reverse
-        rfind
-        rindex
-        rsplit
-        translate
+        .__add__(bytes) -> bytes
+        .__contains__(int | bytes) -> bool
+        .__delitem__(int | slice) -> None
+        .__delslice__(int, int) -> None
+        .__eq__(bytes) -> bool
+        .__ge__(bytes) -> bool
+        .__getitem__(int | slice) -> int | bytes
+        .__getslice__(int, int) -> bytes
+        .__gt__(bytes) -> bool
+        .__iadd__(bytes) -> bytes
+        .__imul__(int) -> bytes
+        .__iter__() -> iterator
+        .__le__(bytes) -> bool
+        .__len__() -> int
+        .__lt__(bytes) -> bool
+        .__mul__(int) -> bytes
+        .__ne__(bytes) -> bool
+        .__reduce__(...) -> ...
+        .__reduce_ex__(...) -> ...
+        .__repr__() -> str
+        .__reversed__() -> bytes
+        .__rmul__(int) -> bytes
+        .__setitem__(int | slice, int | iterable[int]) -> None
+        .__setslice__(int, int, iterable[int]) -> Bote
+        .append(int) -> None
+        .count(int) -> int
+        .decode(str) -> str | unicode # in 3.0, only str
+        .endswith(bytes) -> bool
+        .extend(iterable[int]) -> None
+        .find(bytes) -> int
+        .index(bytes | int) -> int
+        .insert(int, int) -> None
+        .join(iterable[bytes]) -> bytes
+        .partition(bytes) -> (bytes, bytes, bytes)
+        .pop([int]) -> int
+        .remove(int) -> None
+        .replace(bytes, bytes) -> bytes
+        .rindex(bytes | int) -> int
+        .rpartition(bytes) -> (bytes, bytes, bytes)
+        .split(bytes) -> list[bytes]
+        .startswith(bytes) -> bool
+        .reverse() -> None
+        .rfind(bytes) -> int
+        .rindex(bytes | int) -> int
+        .rsplit(bytes) -> list[bytes]
+        .translate(bytes, [bytes]) -> bytes

    Note the conspicuous absence of .isupper(), .upper(), and friends.
-    There is no __hash__ because the object is mutable.  There is no
-    usecase for a .sort() method.
+    (But see "Open Issues" below.)  There is no .__hash__() because
+    the object is mutable.  There is no use case for a .sort() method.

-    The bytes also supports the buffer interface, supporting reading
-    and writing binary (but not character) data.
+    The bytes type also supports the buffer interface, supporting
+    reading and writing binary (but not character) data.


-Out of scope issues
+Out of Scope Issues

-    * If we provide a literal syntax for bytes then it should look
-      distinctly different than the syntax for literal strings.  Also, a
-      new type, even built-in, is much less drastic than a new literal
-      (which requires lexer and parser support in addition to everything
-      else).  Since there appears to be no immediate need for a literal
-      representation, designing and implementing one is out of the scope
-      of this PEP.  (Hmm...  A b"..." literal accepting only ASCII
-      values is likely to be added to 3.0; not clear about 2.6.  This
-      needs a PEP.)
+    * Python 3k will have a much different I/O subsystem.  Deciding
+      how that I/O subsystem will work and interact with the bytes
+      object is out of the scope of this PEP.  The expectation however
+      is that binary I/O will read and write bytes, while text I/O
+      will read strings.  Since the bytes type supports the buffer
+      interface, the existing binary I/O operations in Python 2.6 will
+      support bytes objects.

-    * Python 3k will have a much different I/O subsystem.  Deciding how
-      that I/O subsystem will work and interact with the bytes object is
-      out of the scope of this PEP.
-
-    * It has been suggested that a special method named __bytes__ be
-      added to language to allow objects to be converted into byte
+    * It has been suggested that a special method named .__bytes__()
+      be added to language to allow objects to be converted into byte
      arrays.  This decision is out of scope.


-Unresolved issues
+Open Issues

-    * Need to specify the methods more carefully.  
+    * The .decode() method is redundant since a bytes object b can
+      also be decoded by calling unicode(b, <encoding>) (in 2.6) or
+      str(b, <encoding>) (in 3.0).  Do we need encode/decode methods
+      at all?  In a sense the spelling using a constructor is cleaner.
+
+    * Need to specify the methods still more carefully.
+
+    * Pickling and marshalling support need to be specified.

    * Should all those list methods really be implemented?

+    * There is growing support for a b"..." literal.  Here's a brief
+      spec.  Each invocation of b"..." produces a new bytes object
+      (this is unlike "..." but similar to [...] and {...}).  Inside
+      the literal, only ASCII characters and non-Unicode backslash
+      escapes are allowed; non-ASCII characters not specified as
+      escapes are rejected by the compiler regardless of the source
+      encoding.  The resulting object's value is the same as if
+      bytes(map(ord, "...")) were called.
+
    * A case could be made for supporting .ljust(), .rjust(),
      .center() with a mandatory second argument.

    * A case could be made for supporting .split() with a mandatory
      argument.

-    * How should pickling and marshalling work?
-
-    * I probably forgot a few things.
+    * A case could even be made for supporting .islower(), .isupper(),
+      .isspace(), .isalpha(), .isalnum(), .isdigit() and the
+      corresponding conversions (.lower() etc.), using the ASCII
+      definitions for letters, digits and whitespace.  If this is
+      accepted, the cases for .ljust(), .rjust(), .center() and
+      .split() become much stronger, and they should have default
+      arguments as well, using an ASCII space or all ASCII whitespace
+      (for .split()).


-Questions and answers
+Frequently Asked Questions

    Q: Why have the optional encoding argument when the encode method of
       Unicode objects does the same thing.
@ -209,7 +238,7 @@ Questions and answers
       in which case you have to use the former.


-    Q: Why does bytes ignore the encoding argument if the initialiser is
+    Q: Why does bytes ignore the encoding argument if the initializer is
       a str?  (This only applies to 2.6.)

    A: There is no sane meaning that the encoding can have in that case.