From 71b23a63200b6861e0430fc191a68b3e1d93094b Mon Sep 17 00:00:00 2001 From: "Phillip J. Eby" Date: Sat, 25 Sep 2010 19:44:55 +0000 Subject: [PATCH] WSGI is now Python 3-friendly. This does not cover the other planned addenda/errata, and it may need more work even on these bits, but it is now begun. (Many thanks to Graham and Ian.) --- pep-0333.txt | 143 ++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 101 insertions(+), 42 deletions(-) diff --git a/pep-0333.txt b/pep-0333.txt index 2547fca1f..6fc4644d3 100644 --- a/pep-0333.txt +++ b/pep-0333.txt @@ -142,6 +142,51 @@ callable was provided to it. Callables are only to be called, not introspected upon. +A Note On String Types +---------------------- + +In general, HTTP deals with bytes, which means that this specification +is mostly about handling bytes. + +However, the content of those bytes often has some kind of textual +interpretation, and in Python, strings are the most convenient way +to handle text. + +But in many Python versions and implementations, strings are Unicode, +rather than bytes. This requires a careful balance between a usable +API and correct translations between bytes and text in the context of +HTTP... especially to support porting code between Python +implementations with different ``str`` types. + +WSGI therefore defines two kinds of "string": + +* "Native" strings (which are always implemented using the type + named ``str``) that are used for request/response headers and + metadata + +* "Bytestrings" (which are implemented using the ``bytes`` type + in Python 3, and ``str`` elsewhere), that are used for the bodies + of requests and responses (e.g. POST/PUT input data and HTML page + outputs). + +Do not be confused however: even if Python's ``str`` type is actually +Unicode "under the hood", the *content* of native strings must +still be translatable to bytes via the Latin-1 encoding! (See +the section on `Unicode Issues`_ later in this document for more +details.) + +In short: where you see the word "string" in this document, it refers +to a "native" string, i.e., an object of type ``str``, whether it is +internally implemented as bytes or unicode. Where you see references +to "bytestring", this should be read as "an object of type ``bytes`` +under Python 3, or type ``str`` under Python 2". + +And so, even though HTTP is in some sense "really just bytes", there +are many API conveniences to be had by using whatever Python's +default ``str`` type is. + + + The Application/Framework Side ------------------------------ @@ -164,13 +209,15 @@ support application developers.) Here are two example application objects; one is a function, and the other is a class:: + # this would need to be a byte string in Python 3: + HELLO_WORLD = "Hello world!\n" + def simple_app(environ, start_response): """Simplest possible application object""" status = '200 OK' response_headers = [('Content-type', 'text/plain')] start_response(status, response_headers) - return ['Hello world!\n'] - + return [HELLO_WORLD] class AppClass: """Produce the same output, but using a class @@ -195,7 +242,7 @@ other is a class:: status = '200 OK' response_headers = [('Content-type', 'text/plain')] self.start(status, response_headers) - yield "Hello world!\n" + yield HELLO_WORLD The Server/Gateway Side @@ -243,7 +290,7 @@ server. sys.stdout.write('%s: %s\r\n' % header) sys.stdout.write('\r\n') - sys.stdout.write(data) + sys.stdout.write(data) # TODO: this needs to be binary on Py3 sys.stdout.flush() def start_response(status, response_headers, exc_info=None): @@ -326,7 +373,7 @@ a block boundary.) """Transform iterated output to piglatin, if it's okay to do so Note that the "okayness" can change until the application yields - its first non-empty string, so 'transform_ok' has to be a mutable + its first non-empty bytestring, so 'transform_ok' has to be a mutable truth value. """ @@ -341,7 +388,7 @@ a block boundary.) def next(self): if self.transform_ok: - return piglatin(self._next()) + return piglatin(self._next()) # call must be byte-safe on Py3 else: return self._next() @@ -376,7 +423,7 @@ a block boundary.) if transform_ok: def write_latin(data): - write(piglatin(data)) + write(piglatin(data)) # call must be byte-safe on Py3 return write_latin else: return write @@ -426,7 +473,7 @@ It is used only when the application has trapped an error and is attempting to display an error message to the browser. The ``start_response`` callable must return a ``write(body_data)`` -callable that takes one positional parameter: a string to be written +callable that takes one positional parameter: a bytestring to be written as part of the HTTP response body. (Note: the ``write()`` callable is provided only to support certain existing frameworks' imperative output APIs; it should not be used by new applications or frameworks if it @@ -434,24 +481,24 @@ can be avoided. See the `Buffering and Streaming`_ section for more details.) When called by the server, the application object must return an -iterable yielding zero or more strings. This can be accomplished in a -variety of ways, such as by returning a list of strings, or by the -application being a generator function that yields strings, or +iterable yielding zero or more bytestrings. This can be accomplished in a +variety of ways, such as by returning a list of bytestrings, or by the +application being a generator function that yields bytestrings, or by the application being a class whose instances are iterable. Regardless of how it is accomplished, the application object must -always return an iterable yielding zero or more strings. +always return an iterable yielding zero or more bytestrings. -The server or gateway must transmit the yielded strings to the client -in an unbuffered fashion, completing the transmission of each string +The server or gateway must transmit the yielded bytestrings to the client +in an unbuffered fashion, completing the transmission of each bytestring before requesting another one. (In other words, applications **should** perform their own buffering. See the `Buffering and Streaming`_ section below for more on how application output must be handled.) -The server or gateway should treat the yielded strings as binary byte +The server or gateway should treat the yielded bytestrings as binary byte sequences: in particular, it should ensure that line endings are not altered. The application is responsible for ensuring that the -string(s) to be written are in a format suitable for the client. (The +bytestring(s) to be written are in a format suitable for the client. (The server or gateway **may** apply HTTP transfer encodings, or perform other transformations for the purpose of implementing HTTP features such as byte-range transmission. See `Other HTTP Features`_, below, @@ -472,7 +519,7 @@ by the application. This protocol is intended to complement PEP 325's generator support, and other common iterables with ``close()`` methods. (Note: the application **must** invoke the ``start_response()`` -callable before the iterable yields its first body string, so that the +callable before the iterable yields its first body bytestring, so that the server can send the headers before any body content. However, this invocation **may** be performed by the iterable's first iteration, so servers **must not** assume that ``start_response()`` has been called @@ -565,7 +612,7 @@ have a fallback plan in the event such a variable is absent. Note: missing variables (such as ``REMOTE_USER`` when no authentication has occurred) should be left out of the ``environ`` -dictionary. Also note that CGI-defined variables must be strings, +dictionary. Also note that CGI-defined variables must be native strings, if they are present at all. It is a violation of this specification for a CGI variable's value to be of any type other than ``str``. @@ -585,9 +632,9 @@ Variable Value ``"http"`` or ``"https"``, as appropriate. ``wsgi.input`` An input stream (file-like object) from which - the HTTP request body can be read. (The server - or gateway may perform reads on-demand as - requested by the application, or it may pre- + the HTTP request body bytes can be read. (The + server or gateway may perform reads on-demand + as requested by the application, or it may pre- read the client's request body and buffer it in-memory or on disk, or use any other technique for providing such an input stream, @@ -602,6 +649,12 @@ Variable Value ending, and assume that it will be converted to the correct line ending by the server/gateway. + (On platforms where the ``str`` type is unicode, + the error stream **should** accept and log + arbitary unicode without raising an error; it + is allowed, however, to substitute characters + that cannot be rendered in the stream's encoding.) + For many servers, ``wsgi.errors`` will be the server's main error log. Alternatively, this may be ``sys.stderr``, or a log file of some @@ -745,7 +798,7 @@ headers, please see the `Other HTTP Features`_ section below.) The ``start_response`` callable **must not** actually transmit the response headers. Instead, it must store them for the server or gateway to transmit **only** after the first iteration of the -application return value that yields a non-empty string, or upon +application return value that yields a non-empty bytestring, or upon the application's first invocation of the ``write()`` callable. In other words, response headers must not be sent until there is actual body data available, or until the application's returned iterable is @@ -820,12 +873,12 @@ able to either generate a ``Content-Length`` header, or at least avoid the need to close the client connection. If the application does *not* call the ``write()`` callable, and returns an iterable whose ``len()`` is 1, then the server can automatically determine -``Content-Length`` by taking the length of the first string yielded +``Content-Length`` by taking the length of the first bytestring yielded by the iterable. And, if the server and client both support HTTP/1.1 "chunked encoding" [3]_, then the server **may** use chunked encoding to send -a chunk for each ``write()`` call or string yielded by the iterable, +a chunk for each ``write()`` call or bytestring yielded by the iterable, thus generating a ``Content-Length`` header for each chunk. This allows the server to keep the client connection alive, if it wishes to do so. Note that the server **must** comply fully with RFC 2616 @@ -850,7 +903,7 @@ transmitted all at once, along with the response headers. The corresponding approach in WSGI is for the application to simply return a single-element iterable (such as a list) containing the -response body as a single string. This is the recommended approach +response body as a single bytestring. This is the recommended approach for the vast majority of application functions, that render HTML pages whose text easily fits in memory. @@ -899,12 +952,12 @@ In order to better support asynchronous applications and servers, middleware components **must not** block iteration waiting for multiple values from an application iterable. If the middleware needs to accumulate more data from the application before it can -produce any output, it **must** yield an empty string. +produce any output, it **must** yield an empty bytestring. To put this requirement another way, a middleware component **must yield at least one value** each time its underlying application yields a value. If the middleware cannot yield any other value, -it must yield an empty string. +it must yield an empty bytestring. This requirement ensures that asynchronous applications and servers can conspire to reduce the number of threads that are required @@ -946,22 +999,22 @@ for web servers to interleave other tasks in the same Python thread, potentially providing better throughput for the server as a whole. The ``write()`` callable is returned by the ``start_response()`` -callable, and it accepts a single parameter: a string to be +callable, and it accepts a single parameter: a bytestring to be written as part of the HTTP response body, that is treated exactly as though it had been yielded by the output iterable. In other words, before ``write()`` returns, it must guarantee that the -passed-in string was either completely sent to the client, or +passed-in bytestring was either completely sent to the client, or that it is buffered for transmission while the application proceeds onward. An application **must** return an iterable object, even if it uses ``write()`` to produce all or part of its response body. The returned iterable **may** be empty (i.e. yield no non-empty -strings), but if it *does* yield non-empty strings, that output +bytestrings), but if it *does* yield non-empty bytestrings, that output must be treated normally by the server or gateway (i.e., it must be sent or queued immediately). Applications **must not** invoke ``write()`` from within their return iterable, and therefore any -strings yielded by the iterable are transmitted after all strings +bytestrings yielded by the iterable are transmitted after all bytestrings passed to ``write()`` have been sent to the client. @@ -970,9 +1023,9 @@ Unicode Issues HTTP does not directly support Unicode, and neither does this interface. All encoding/decoding must be handled by the application; -all strings passed to or from the server must be standard Python byte -strings, not Unicode objects. The result of using a Unicode object -where a string object is required, is undefined. +all strings passed to or from the server must be of type ``str`` or +``bytes``, never ``unicode``. The result of using a ``unicode`` +object where a string object is required, is undefined. Note also that strings passed to ``start_response()`` as a status or as response headers **must** follow RFC 2616 with respect to encoding. @@ -980,7 +1033,7 @@ That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding. On Python platforms where the ``str`` or ``StringType`` type is in -fact Unicode-based (e.g. Jython, IronPython, Python 3000, etc.), all +fact Unicode-based (e.g. Jython, IronPython, Python 3, etc.), all "strings" referred to in this specification must contain only code points representable in ISO-8859-1 encoding (``\u0000`` through ``\u00FF``, inclusive). It is a fatal error for an application to @@ -988,12 +1041,18 @@ supply strings containing any other Unicode character or code point. Similarly, servers and gateways **must not** supply strings to an application containing any other Unicode characters. -Again, all strings referred to in this specification **must** be -of type ``str`` or ``StringType``, and **must not** be of type -``unicode`` or ``UnicodeType``. And, even if a given platform allows -for more than 8 bits per character in ``str``/``StringType`` objects, -only the lower 8 bits may be used, for any value referred to in -this specification as a "string". +Again, all objects referred to in this specification as "strings" +**must** be of type ``str`` or ``StringType``, and **must not** be +of type ``unicode`` or ``UnicodeType``. And, even if a given platform +allows for more than 8 bits per character in ``str``/``StringType`` +objects, only the lower 8 bits may be used, for any value referred +to in this specification as a "string". + +For values referred to in this specification as "bytestrings" +(i.e., values read from ``wsgi.input``, passed to ``write()`` +or yielded by the application), the value **must** be of type +``bytes`` under Python 3, and ``str`` in earlier versions of +Python. Error Handling @@ -1448,7 +1507,7 @@ Questions and Answers ``environ`` dictionary. This is the recommended approach for offering any such value-added services. -2. Why can you call ``write()`` *and* yield strings/return an +2. Why can you call ``write()`` *and* yield bytestrings/return an iterable? Shouldn't we pick just one way? If we supported only the iteration approach, then current