python-peps/pep-0456.txt

589 lines
20 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

PEP: 456
Title: Secure and interchangeable hash algorithm
Version: $Revision$
Last-Modified: $Date$
Author: Christian Heimes <christian@python.org>
BDFL-Delegate: Nick Coghlan
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 27-Sep-2013
Python-Version: 3.4
Post-History: 06-Oct-2013
Abstract
========
This PEP proposes SipHash as default string and bytes hash algorithm to properly
fix hash randomization once and for all. It also proposes modifications to
Python's C code in order to unify the hash code and to make it easily
interchangeable.
Rationale
=========
Despite the last attempt [issue13703]_ CPython is still vulnerable to hash
collision DoS attacks [29c3]_ [issue14621]_. The current hash algorithm and
its randomization is not resilient against attacks. Only a proper
cryptographic hash function prevents the extraction of secret randomization
keys. Although no practical attack against a Python-based service has been
seen yet, the weakness has to be fixed. Jean-Philippe Aumasson and Daniel
J. Bernstein have already shown how the seed for the current implementation
can be recovered [poc]_.
Furthermore the current hash algorithm is hard-coded and implemented multiple
times for bytes and three different Unicode representations UCS1, UCS2 and
UCS4. This makes it impossible for embedders to replace it with a different
implementation without patching and recompiling large parts of the interpreter.
Embedders may want to choose a more suitable hash function.
Finally the current implementation code does not perform well. In the common
case it only processes one or two bytes per cycle. On a modern 64-bit processor
the code can easily be adjusted to deal with eight bytes at once.
This PEP proposes three major changes to the hash code for strings and bytes:
* SipHash [sip]_ is introduced as default hash algorithm. It is fast and small
despite its cryptographic properties. Due to the fact that it was designed
by well known security and crypto experts, it is safe to assume that its
secure for the near future.
* The existing FNV code is kept for platforms without a 64-bit data type. The
algorithm is optimized to process larger chunks per cycle.
* Calculation of the hash of strings and bytes is moved into a single API
function instead of multiple specialized implementations in
``Objects/object.c`` and ``Objects/unicodeobject.c``. The function takes a
void pointer plus length and returns the hash for it.
* The algorithm can be selected at compile time. FNV is guaranteed to exist
on all platforms. SipHash is available on the majority of modern systems.
Requirements for a hash function
================================
* It MUST be able to hash arbitrarily large blocks of memory from 1 byte up
to the maximum ``ssize_t`` value.
* It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits
on 64-bit platforms. (Note: Larger outputs can be compressed with e.g.
``v ^ (v >> 32)``.)
* It MUST support hashing of unaligned memory in order to support
hash(memoryview).
* It MUST NOT return ``-1``. The value is reserved for error cases and yet
uncached hash values. (Note: A special case can be added to map ``-1``
to ``-2``.)
* It is highly RECOMMENDED that the length of the input influences the
outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
* It MAY return ``0`` for zero length input in order to disguise the
randomization seed. (Note: This can be handled as special case, too.)
Current implementation with modified FNV
========================================
CPython currently uses uses a variant of the Fowler-Noll-Vo hash function
[fnv]_. The variant is has been modified to reduce the amount and cost of hash
collisions for common strings. The first character of the string is added
twice, the first time with a bit shift of 7. The length of the input
string is XOR-ed to the final value. Both deviations from the original FNV
algorithm reduce the amount of hash collisions for short strings.
Recently [issue13703]_ a random prefix and suffix were added as an attempt to
randomize the hash values. In order to protect the hash secret the code still
returns ``0`` for zero length input.
C code::
Py_uhash_t x;
Py_ssize_t len;
/* p is either 1, 2 or 4 byte type */
unsigned char *p;
Py_UCS2 *p;
Py_UCS4 *p;
if (len == 0)
return 0;
x = (Py_uhash_t) _Py_HashSecret.prefix;
x ^= (Py_uhash_t) *p << 7;
for (i = 0; i < len; i++)
x = (1000003 * x) ^ (Py_uhash_t) *p++;
x ^= (Py_uhash_t) len;
x ^= (Py_uhash_t) _Py_HashSecret.suffix;
return x;
Which roughly translates to Python::
def fnv(p):
if len(p) == 0:
return 0
# bit mask, 2**32-1 or 2**64-1
mask = 2 * sys.maxsize + 1
x = hashsecret.prefix
x = (x ^ (ord(p[0]) << 7)) & mask
for c in p:
x = ((1000003 * x) ^ ord(c)) & mask
x = (x ^ len(p)) & mask
x = (x ^ hashsecret.suffix) & mask
if x == -1:
x = -2
return x
FNV is a simple multiply and XOR algorithm with no cryptographic properties.
The randomization was not part of the initial hash code, but was added as
counter measure against hash collision attacks as explained in oCERT-2011-003
[ocert]_. Because FNV is not a cryptographic hash algorithm and the dict
implementation is not fortified against side channel analysis, the
randomization secrets can be calculated by a remote attacker. The author of
this PEP strongly believes that the nature of a non-cryptographic hash
function makes it impossible to conceal the secrets.
Examined hashing algorithms
===========================
The author of this PEP has researched several hashing algorithms that are
considered modern, fast and state-of-the-art.
SipHash
-------
SipHash [sip]_ is a cryptographic pseudo random function with a 128-bit seed
and 64-bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl,
OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has
been released under CC0 license (public domain).
Quote from SipHash's site:
SipHash is a family of pseudorandom functions (a.k.a. keyed hash
functions) optimized for speed on short messages. Target applications
include network traffic authentication and defense against hash-flooding
DoS attacks.
siphash24 is the recommend variant with best performance. It uses 2 rounds per
message block and 4 finalization rounds. Besides the reference implementation
several other implementations are available. Some are single-shot functions,
others use a MerkleDamgård construction-like approach with init, update and
finalize functions. Marek Majkowski C implementation csiphash [csiphash]_
defines the prototype of the function. (Note: ``k`` is split up into two
uint64_t)::
uint64_t siphash24(const void *src,
unsigned long src_sz,
const char k[16]);
SipHash requires a 64-bit data type and is not compatible with pure C89
platforms.
MurmurHash
----------
MurmurHash [murmur]_ is a family of non-cryptographic keyed hash function
developed by Austin Appleby. Murmur3 is the latest and fast variant of
MurmurHash. The C++ reference implementation has been released into public
domain. It features 32- or 128-bit output with a 32-bit seed. (Note: The out
parameter is a buffer with either 1 or 4 bytes.)
Murmur3's function prototypes are::
void MurmurHash3_x86_32(const void *key,
int len,
uint32_t seed,
void *out);
void MurmurHash3_x86_128(const void * key,
int len,
uint32_t seed,
void *out);
void MurmurHash3_x64_128(const void *key,
int len,
uint32_t seed,
void *out);
The 128-bit variants requires a 64-bit data type and are not compatible with
pure C89 platforms. The 32-bit variant is fully C89-compatible.
Aumasson, Bernstein and Boßlet have shown [sip]_ [ocert-2012-001]_ that
Murmur3 is not resilient against hash collision attacks. Therefore Murmur3
can no longer be considered as secure algorithm. It still may be an
alternative is hash collision attacks are of no concern.
CityHash
--------
CityHash [city]_ is a family of non-cryptographic hash function developed by
Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation
has been released under MIT license. The algorithm is partly based on
MurmurHash and claims to be faster. It supports 64- and 128-bit output with a
128-bit seed as well as 32-bit output without seed.
The relevant function prototype for 64-bit CityHash with 128-bit seed is::
uint64 CityHash64WithSeeds(const char *buf,
size_t len,
uint64 seed0,
uint64 seed1)
CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long
inputs. All variants except CityHash32 require 64-bit data types. CityHash32
uses only 32-bit data types but it doesn't support seeding.
Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip]_ a similar
weakness in CityHash.
HMAC, MD5, SHA-1, SHA-2
-----------------------
These hash algorithms are too slow and have high setup and finalization costs.
For these reasons they are not considered fit for this purpose.
AES CMAC
--------
Modern AMD and Intel CPUs have AES-NI (AES instruction set) [aes-ni]_ to speed
up AES encryption. CMAC with AES-NI might be a viable option but it's probably
too slow for daily operation. (testing required)
Conclusion
----------
SipHash provides the best combination of speed and security. Developers of
other prominent projects have came to the same conclusion.
C API additions
===============
All C API extension modifications are not part of the stable API.
hash secret
-----------
The ``_Py_HashSecret_t`` type of Python 2.6 to 3.3 has two members with either
32- or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys.
The typedef will be changed to an union with a guaranteed size of 128 bits on
all architectures. On platforms with a 64-bit data type it will have two
``uint64`` members. Because C89 compatible compilers may not have ``uint64``
the union also has an array of 16 chars.
new type definition::
typedef union {
unsigned char uc16[16];
struct {
Py_hash_t prefix;
Py_hash_t suffix;
} ht;
#ifdef PY_UINT64_T
struct {
PY_UINT64_T k0;
PY_UINT64_T k1;
} ui64;
#endif
} _Py_HashSecret_t;
PyAPI_DATA(_Py_HashSecret_t) _Py_HashSecret;
``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
exactly once at startup.
hash function
-------------
function prototype::
typedef Py_hash_t (*PyHash_Func_t)(const void *, Py_ssize_t);
hash function selection
-----------------------
type definition::
#define PY_HASH_SIPHASH24 0x53495024
#define PY_HASH_FNV 0x464E56
#ifndef PY_HASH_ALGORITHM
#if defined(PY_UINT64_T) && defined(PY_UINT32_T)
#define PY_HASH_ALGORITHM PY_HASH_SIPHASH24
#else
#define PY_HASH_ALGORITHM PY_HASH_FNV
#endif /* uint64_t && uint32_t */
#endif /* PY_HASH_ALGORITHM */
typedef struct {
PyHash_Func_t hash; /* function pointer */
char *name; /* name of the hash algorithm and variant */
int hash_bits; /* internal size of hash value */
int seed_bits; /* size of seed input */
} PyHash_FuncDef;
PyAPI_DATA(PyHash_FuncDef) PyHash_Func;
Implementation::
#if PY_HASH_ALGORITHM == PY_HASH_FNV
PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
16 * sizeof(Py_hash_t)};
#endif
#if PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
#endif
TODO: select hash algorithm with autoconf variable
Python API addition
===================
sys module
----------
The sys module already has a hash_info struct sequence. More fields are added
to the object to reflect the active hash algorithm and its properties.
::
sys.hash_info(width=64,
modulus=2305843009213693951,
inf=314159,
nan=0,
imag=1000003,
# new fields:
algorithm='siphash24',
hash_bits=64,
seed_bits=128)
Necessary modifications to C code
=================================
_Py_HashBytes (Objects/object.c)
--------------------------------
``_Py_HashBytes`` is an internal helper function that provides the hashing
code for bytes, memoryview and datetime classes. It currently implements FNV
for ``unsigned char*``. The function can either be modified to use the new
API or it could be completely removed to avoid an unnecessary level of
indirection.
bytes_hash (Objects/bytesobject.c)
----------------------------------
``bytes_hash`` uses ``_Py_HashBytes`` to provide the tp_hash slot function
for bytes objects. If ``_Py_HashBytes`` is to be removed then ``bytes_hash``
must be reimplemented.
memory_hash (Objects/memoryobject.c)
------------------------------------
``memory_hash`` provides the tp_hash slot function for read-only memory
views if the original object is hashable, too. It's the only function that
has to support hashing of unaligned memory segments in the future.
unicode_hash (Objects/unicodeobject.c)
--------------------------------------
``unicode_hash`` provides the tp_hash slot function for unicode. Right now it
implements the FNV algorithm three times for ``unsigned char*``, ``Py_UCS2``
and ``Py_UCS4``. A reimplementation of the function must take care to use the
correct length. Since the macro ``PyUnicode_GET_LENGTH`` returns the length
of the unicode string and not its size in octets, the length must be
multiplied with the size of the internal unicode kind::
if (PyUnicode_READY(u) == -1)
return -1;
x = PyHash_Func.hash(PyUnicode_DATA(u),
PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
generic_hash (Modules/_datetimemodule.c)
----------------------------------------
``generic_hash`` acts as a wrapper around ``_Py_HashBytes`` for the tp_hash
slots of date, time and datetime types. timedelta objects are hashed by their
state (days, seconds, microseconds) and tzinfo objects are not hashable. The
data members of date, time and datetime types' struct are not ``void*`` aligned.
This can easily by fixed with memcpy()ing four to ten bytes to an aligned
buffer.
Further things to consider
==========================
ASCII str / bytes hash collision
--------------------------------
Since the implementation of [pep-0393]_ bytes and ASCII text have the same
memory layout. Because of this the new hashing API will keep the invariant::
hash("ascii string") == hash(b"ascii string")
for ASCII string and ASCII bytes. Equal hash values result in a hash collision
and therefore cause a minor speed penalty for dicts and sets with mixed keys.
The cause of the collision could be removed by e.g. subtracting ``2`` from
the hash value of bytes. (``-2`` because ``hash(b"") == 0`` and ``-1`` is
reserved.)
Performance
===========
TBD
First tests suggest that SipHash performs a bit faster on 64-bit CPUs when
it is fed with medium size byte strings as well as ASCII and UCS2 Unicode
strings. For very short strings the setup cost for SipHash dominates its
speed but it is still in the same order of magnitude as the current FNV code.
It's yet unknown how the new distribution of hash values affects collisions
of common keys in dicts of Python classes.
Typical length
--------------
Serhiy Storchaka has shown in [issue16427]_ that a modified FNV
implementation with 64 bits per cycle is able to process long strings several
times faster than the current FNV implementation.
However according to statistics [issue19183]_ a typical Python program as
well as the Python test suite have a hash ratio of about 50% small strings
between 1 and 6 bytes. Only 5% of the strings are larger than 16 bytes.
Grand Unified Python Benchmark Suite
------------------------------------
Initial tests with an experimental implementation and the Grand Unified Python
Benchmark Suite have shown minimal deviations. The summarized total runtime
of the benchmark is within 1% of the runtime of an unmodified Python 3.4
binary. The tests were run on an Intel i7-2860QM machine with a 64-bit Linux
installation. The interpreter was compiled with GCC 4.7 for 64- and 32-bit.
More benchmarks will be conducted.
Backwards Compatibility
=======================
The modifications don't alter any existing API.
The output of ``hash()`` for strings and bytes are going to be different. The
hash values for ASCII Unicode and ASCII bytes will stay equal.
Alternative counter measures against hash collision DoS
=======================================================
Three alternative countermeasures against hash collisions were discussed in
the past, but are not subject of this PEP.
1. Marc-Andre Lemburg has suggested that dicts shall count hash collisions. In
case an insert operation causes too many collisions an exception shall be
raised.
2. Some applications (e.g. PHP) limit the amount of keys for GET and POST
HTTP requests. The approach effectively leverages the impact of a hash
collision attack. (XXX citation needed)
3. Hash maps have a worst case of O(n) for insertion and lookup of keys. This
results in a quadratic runtime during a hash collision attack. The
introduction of a new and additional data structure with with O(log n)
worst case behavior would eliminate the root cause. A data structures like
red-black-tree or prefix trees (trie [trie]_) would have other benefits,
too. Prefix trees with stringed keyed can reduce memory usage as common
prefixes are stored within the tree structure.
Discussion
==========
Pluggable
---------
The first draft of this PEP made the hash algorithm pluggable at runtime. It
supported multiple hash algorithms in one binary to give the user the
possibility to select a hash algorithm at startup. The approach was considered
an unnecessary complication by several core committers [pluggable]_. Subsequent
versions of the PEP aim for compile time configuration.
References
==========
* Issue 19183 [issue19183]_ contains a reference implementation.
.. [29c3] http://events.ccc.de/congress/2012/Fahrplan/events/5152.en.html
.. [fnv] http://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function
.. [sip] https://131002.net/siphash/
.. [ocert] http://www.nruns.com/_downloads/advisory28122011.pdf
.. [ocert-2012-001] http://www.ocert.org/advisories/ocert-2012-001.html
.. [poc] https://131002.net/siphash/poc.py
.. [issue13703] http://bugs.python.org/issue13703
.. [issue14621] http://bugs.python.org/issue14621
.. [issue16427] http://bugs.python.org/issue16427
.. [issue19183] http://bugs.python.org/issue19183
.. [trie] http://en.wikipedia.org/wiki/Trie
.. [city] http://code.google.com/p/cityhash/
.. [murmur] http://code.google.com/p/smhasher/
.. [csiphash] https://github.com/majek/csiphash/
.. [pep-0393] http://www.python.org/dev/peps/pep-0393/
.. [aes-ni] http://en.wikipedia.org/wiki/AES_instruction_set
.. [pluggable] https://mail.python.org/pipermail/python-dev/2013-October/129138.html
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End: