python-peps/pep-0456.txt

PEP: 456
Title: Pluggable and secure hash algorithm
Version: $Revision$
Last-Modified: $Date$
Author: Christian Heimes <christian@python.org>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 27-Sep-2013
Python-Version: 3.4
Post-History:


Abstract
========

This PEP proposes SipHash as default string and bytes hash algorithm to properly
fix hash randomization once and for all. It also proposes an addition to
Python's C API in order to make the hash code pluggable. The new API allows to
select the algorithm on startup as well as the addition of more hash algorithms.


Rationale
=========

Despite the last attempt [issue13703]_ CPython is still vulnerable to hash
collision DoS attacks [29c3]_ [issue14621]_. The current hash algorithm and
its randomization is not resilient against attacks. Only a proper
cryptographic hash function prevents the extraction of secret randomization
keys. Although no practical attack against a Python-based service has been
seen yet, the weakness has to be fixed. Jean-Philippe Aumasson and Daniel
J. Bernstein have already shown how the seed for the current implementation
can be recovered [poc]_.

Furthermore the current hash algorithm is hard-coded and implemented multiple
times for bytes and three different Unicode representations UCS1, UCS2 and
UCS4. This makes it impossible for embedders to replace it with a different
implementation without patching and recompiling large parts of the interpreter.
Embedders may want to choose a more suitable hash function.

Finally the current implementation code does not perform well. In the common
case it only processes one or two bytes per cycle. On a modern 64bit processor
the code can easily be adjusted to deal with eight bytes at once.

This PEP proposes three major changes to the hash code for strings and bytes:

* SipHash [sip]_ is introduced as default hash algorithm. It is fast and small
  despite its cryptographic properties. Due to the fact that it was designed
  by well known security and crypto experts, it is safe to assume that its
  secure for the near future.

* Calculation of the hash of strings and bytes is moved into a single API 
  function instead of multiple specialized implementations in 
  ``Objects/object.c`` and ``Objects/unicodeobject.c``. The function takes a
  void pointer plus length and returns the hash for it.

* The algorithm can be selected by the user with an environment variable,
  command line argument or by an embedder with an API function. By default FNV
  and SipHash are available for selection.


Current implementation with modified FNV
========================================

CPython currently uses uses a variant of the Fowler-Noll-Vo hash function
[fnv]_. The variant is has been modified to reduce the amount and cost of hash
collisions for common strings. The first character of the string is added
twice, the first time time with a bit shift of 7. The length of the input
string is XOR-ed to the final value. Both deviations from the original FNV
algorithm reduce the amount of hash collisions for short strings.

Recently [issue13703]_ a random prefix and suffix were added as an attempt to
randomize the hash values. In order to protect the hash secret the code still
returns ``0`` for zero length input.

C code::

    Py_uhash_t x;
    Py_ssize_t len;
    /* p is either 1, 2 or 4 byte type */
    unsigned char *p;
    Py_UCS2 *p;
    Py_UCS4 *p;

    if (len == 0)
        return 0;
    x = (Py_uhash_t) _Py_HashSecret.prefix;
    x ^= (Py_uhash_t) *p << 7;
    for (i = 0; i < len; i++)
        x = (1000003 * x) ^ (Py_uhash_t) *p++;
    x ^= (Py_uhash_t) len;
    x ^= (Py_uhash_t) _Py_HashSecret.suffix;
    return x;


Which roughly translates to Python::

    def fnv(p):
        if len(p) == 0:
            return 0

        # bit mask, 2**32-1 or 2**64-1
        mask = 2 * sys.maxsize + 1

        x = hashsecret.prefix
        x = (x ^ (ord(p[0]) << 7)) & mask
        for c in p:
            x = ((1000003 * x) ^ ord(c)) & mask
        x = (x ^ len(p)) & mask
        x = (x ^ hashsecret.suffix) & mask

        if x == -1:
            x = -2

        return x


FNV is a simple multiply and XOR algorithm with no cryptographic properties.
The randomization was not part of the initial hash code, but was added as
counter measure against hash collision attacks as explained in oCERT-2011-003
[ocert]_. Because FNV is not a cryptographic hash algorithm and the dict
implementation is not fortified against side channel analysis, the
randomization secrets can be calculated by a remote attacker. The author of
this PEP strongly believes that the nature of a non-cryptographic hash
function makes it impossible to conceal the secrets.


Hash algorithm
==============

SipHash
-------

SipHash [sip]_ is a cryptographic pseudo random function with a 128bit seed and
64bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl,
OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has
been released under CC0 license (public domain).

Quote from SipHash's site:

    SipHash is a family of pseudorandom functions (a.k.a. keyed hash
    functions) optimized for speed on short messages. Target applications
    include network traffic authentication and defense against hash-flooding
    DoS attacks.

siphash24 is the recommend variant with best performance. It uses 2 rounds per
message block and 4 finalization rounds.

Marek Majkowski C implementation csiphash [csiphash]_::

    uint64_t siphash24(const void *src,
                       unsigned long src_sz,
                       const char k[16]);


MurmurHash
----------

MurmurHash [murmur]_ is a family of non-cryptographic keyed hash function
developed by Austin Appleby. Murmur3 is the latest and fast variant of
MurmurHash. The C++ reference implementation has been released into public
domain. It features 32bit seed and 32 or 128bit output.

::

    void MurmurHash3_x86_32(const void *key,
                            int len,
                            uint32_t seed,
                            void *out);

    void MurmurHash3_x86_128(const void * key,
                             int len,
                             uint32_t seed,
                             void *out);

    void MurmurHash3_x64_128(const void *key,
                             int len,
                             uint32_t seed,
                             void *out);


CityHash
--------

CityHash [city]_ is a family of non-cryptographic hash function developed by
Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation
has been released under MIT license. The algorithm is partly based on
MurmurHash and claims to be faster. It supports 64 and 128 bit output with a
128bit seed as well as 32bit output without seed.

::

    uint64 CityHash64WithSeeds(const char *buf,
                               size_t len,
                               uint64 seed0,
                               uint64 seed1)


C API Implementation
====================

hash secret
-----------

The ``_Py_HashSecret_t`` type of Python 2.6 to 3.3 has two members with either
32 or 64bit length each. SipHash requires two 64bit unsigned integers as keys.
The typedef will be changed to an union with a guaranteed size of 128bits on
all architectures. On platforms with a 64bit data type it will have two
``uint64`` members. Because C89 compatible compilers may not have ``uint64``
the union also has an array of 16 chars.

new type definition::

    typedef union {
        unsigned char uc16[16];
        struct {
            Py_hash_t prefix;
            Py_hash_t suffix;
        } ht;
    #ifdef PY_UINT64_T
        struct {
            PY_UINT64_T k0;
            PY_UINT64_T k1;
        } ui64;
    #endif
    } _Py_HashSecret_t;

    PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;

``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
exactly once at startup.


hash function table
-------------------

type definition::

    typedef Py_hash_t (*PyHash_func_t)(void *, Py_ssize_t);

    typedef struct {
        PyHash_func_t hashfunc;
        char *name;
        unsigned int precedence;
    } PyHash_FuncDef;

    PyAPI_DATA(PyHash_FuncDef) *PyHash_FuncTable;

Implementation::

    PyHash_FuncDef hash_func_table[] = {
        {fnv, "fnv", 10},
    #ifdef PY_UINT64_T
        {siphash24, "sip24", 20},
    #endif
        {NULL, NULL},
    };

    PyHash_FuncDef *PyHash_FuncTable = hash_func_table;


hash function API
-----------------

::

    int PyHash_SetHashAlgorithm(char *name);

    PyHash_FuncDef* PyHash_GetHashAlgorithm(void);

``PyHash_SetHashAlgorithm(NULL)`` selects the hash algorithm with the highest
precedence. ``PyHash_SetHashAlgorithm("sip24")`` selects siphash24 as hash
algorithm. The function returns ``0`` on success. In case the algorithm is
not supported or a hash algorithm is already set it returns ``-1``.
(XXX use enum?)

``PyHash_GetHashAlgorithm()`` returns a pointer to current hash function
definition or `NULL`.

(XXX use an extern variable to hold a function pointer to improve performance?)


Python API
==========

sys module
----------

The sys module grows a new struct member with information about the select
algorithm as well as all available algorithms.

::

    sys.hash_info(algorithm='siphash24', available=('siphash24', 'fnv'))


_testcapi
---------

The `_testcapi` C module gets a function to hash a buffer or string object
with any supported hash algorithm. The function neither uses nor sets the
cached hash value of the object. The feature is soley intended for benchmarks
and testing.

::

    _testcapi.get_hash(name: str, str_or_buffer) -> int


Further things to consider
==========================

ASCII str / bytes hash collision
--------------------------------

Since the implementation of [#pep-0393]_ bytes and ASCII text have the same
memory layout. Because of this the new hashing API will keep the invariant::

    hash("ascii string") == hash(b"ascii string")

for ASCII string and ASCII bytes. Equal hash values result in a hash collision
and therefore cause a minor speed penalty for dicts and sets with mixed keys.
The cause of the collision could be removed by e.g. subtraction ``-2`` from
the hash value of bytes. (``-2`` because ``hash(b"") == 0`` and ``-1`` is
reserved.)


Performance
===========

TBD

First tests suggest that SipHash performs a bit faster on 64bit CPUs when
it is feed with medium size byte strings as well as ASCII and UCS2 Unicode
strings. For very short strings the setup costs for SipHash dominates its
speed but it is still in the same order of magnitude as the current FNV code.

Serhiy Storchaka has shown in [issue16427]_ that a modified FNV
implementation with 64bits per cycle is able to process long strings several
times faster than the current FNV implementation.


Backwards Compatibility
=======================

The modifications don't alter any existing API.

The output of `hash()` for strings and bytes are going to be different. The
hash values for ASCII Unicode and ASCII bytes will stay equal.


Alternative counter measures against hash collision DoS
=======================================================

Three alternative counter measures against hash collisions were discussed in
the past, but are not subject of this PEP.

1. Marc-Andre Lemburg has suggested that dicts shall count hash collision. In
   case an insert operation causes too many collisions an exception shall be
   raised.

2. Some application (e.g. PHP) have limit the amount of keys for GET and POST
   HTTP request. The approach effectively leverages the impact of a hash
   collision attack. (XXX citation needed)

3. Hash maps have a worst case of O(n) for insertion and lookup of keys. This
   results in an quadratic runtime during a hash collision attack. The
   introduction of a new and additional data structure with with O(log n)
   worst case behavior would eliminate the root cause. A data structures like
   red-black-tree or prefix trees (trie [trie]_) would have other benefits,
   too. Prefix trees with stringed keyed can reduce memory usage as common
   prefixes are stored within the tree structure.


Reference
=========

.. [29c3] http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html

.. [fnv] http://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function

.. [sip] https://131002.net/siphash/

.. [ocert] http://www.nruns.com/_downloads/advisory28122011.pdf

.. [poc] https://131002.net/siphash/poc.py

.. [issue13703] http://bugs.python.org/issue13703

.. [issue14621] http://bugs.python.org/issue14621

.. [issue16427] http://bugs.python.org/issue16427

.. [trie] http://en.wikipedia.org/wiki/Trie

.. [city] http://code.google.com/p/cityhash/

.. [murmur] http://code.google.com/p/smhasher/

.. [csiphash] https://github.com/majek/csiphash/

.. [#pep-0393] http://www.python.org/dev/peps/pep-0393/


Copyright
=========

This document has been placed in the public domain.


..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End: