Add results from first tests with Python benchmark suite

document requirement of 64bit data type
explain hash algorithm requirements with RFC 2119 keywords
This commit is contained in:
Christian Heimes 2013-10-03 12:52:00 +02:00
parent 24c5df69c6
commit afc47ea51a
1 changed files with 60 additions and 33 deletions

View File

@ -39,7 +39,7 @@ implementation without patching and recompiling large parts of the interpreter.
Embedders may want to choose a more suitable hash function.
Finally the current implementation code does not perform well. In the common
case it only processes one or two bytes per cycle. On a modern 64bit processor
case it only processes one or two bytes per cycle. On a modern 64-bit processor
the code can easily be adjusted to deal with eight bytes at once.
This PEP proposes three major changes to the hash code for strings and bytes:
@ -49,14 +49,42 @@ This PEP proposes three major changes to the hash code for strings and bytes:
by well known security and crypto experts, it is safe to assume that its
secure for the near future.
* The existing FNV code is kept for platforms without a 64-bit data type. The
algorithm is optimized to process larger chunks per cycle.
* Calculation of the hash of strings and bytes is moved into a single API
function instead of multiple specialized implementations in
``Objects/object.c`` and ``Objects/unicodeobject.c``. The function takes a
void pointer plus length and returns the hash for it.
* The algorithm can be selected by the user with an environment variable,
command line argument or by an embedder with an API function. By default FNV
and SipHash are available for selection.
command line argument or with an API function (for embedders). FNV is
guaranteed to exist on all platforms. SipHash is available on the majority
of modern systems.
Requirements for a hash function
================================
* It MUST be able to hash arbitrarily large blocks of memory from 1 byte up
to the maximum ``ssize_t`` value.
* It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits
on 64-bit platforms. (Note: Larger outputs can be compressed with e.g.
``v ^ (v >> 32)``.)
* It MUST support hashing of unaligned memory in order to support
hash(memoryview).
* It MUST NOT return ``-1``. The value is reserved for error cases and yet
uncached hash values. (Note: A special case can be added to map ``-1``
to ``-2``.)
* It is highly RECOMMENDED that the length of the input influences the
outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
* It MAY return ``0`` for zero length input in order to disguise the
randomization seed. (Note: This can be handled as special case, too.)
Current implementation with modified FNV
@ -125,28 +153,6 @@ this PEP strongly believes that the nature of a non-cryptographic hash
function makes it impossible to conceal the secrets.
Requirements for a hash function
================================
* It must be able to hash arbitrarily large blocks of memory from 1 bytes up
to the maximum ``ssize_t`` value.
* It must produce at least 32bit values on 32bit platforms and at least 64bit
values on 64bit platforms. (Note: Larger outputs can be compressed with e.g.
``v ^ (v >> 32)``.)
* It must support hashing of unaligned memory in order to support
hash(memoryview).
* It must not return ``-1``. It` either stands for error or missing hash value.
(Note: A special case can be added to map ``-1`` to ``-2``.)
* It should return ``0`` for zero length input. (Note: This can be handled as
special case, too.)
Examined hashing algorithms
===========================
@ -157,7 +163,7 @@ SipHash
-------
SipHash [sip]_ is a cryptographic pseudo random function with a 128bit seed and
64bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
64-bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl,
OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has
been released under CC0 license (public domain).
@ -181,6 +187,9 @@ uint64_t)::
unsigned long src_sz,
const char k[16]);
SipHash requires a 64-bit data type and is not compatible with pure C89
platforms.
MurmurHash
----------
@ -188,7 +197,7 @@ MurmurHash
MurmurHash [murmur]_ is a family of non-cryptographic keyed hash function
developed by Austin Appleby. Murmur3 is the latest and fast variant of
MurmurHash. The C++ reference implementation has been released into public
domain. It features 32 or 128bit output with a 32bit seed. (Note: The out
domain. It features 32 or 128bit output with a 32-bit seed. (Note: The out
parameter is a buffer with either 1 or 4 bytes.)
Murmur3's function prototypes are::
@ -208,6 +217,9 @@ Murmur3's function prototypes are::
uint32_t seed,
void *out);
The 128bit variants requires a 64-bit data type and are not compatible with
pure C89 platforms. The 32-bit variant is fully C89-compatible.
Aumasson, Bernstein and Boßlet have shown [sip]_ [ocert-2012-001]_ that
Murmur3 is not resilient against hash collision attacks. Therefore Murmur3
can no longer be considered as secure algorithm. It still may be an
@ -220,15 +232,18 @@ CityHash [city]_ is a family of non-cryptographic hash function developed by
Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation
has been released under MIT license. The algorithm is partly based on
MurmurHash and claims to be faster. It supports 64 and 128 bit output with a
128bit seed as well as 32bit output without seed.
128bit seed as well as 32-bit output without seed.
::
The relevant function prototype for 64-bit CityHash with 128bit seed is::
uint64 CityHash64WithSeeds(const char *buf,
size_t len,
uint64 seed0,
uint64 seed1)
CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long
inputs. All variants except CityHash32 require 64-bit data types. CityHash32
uses only 32-bit data types but it doesn't support seeding.
Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip]_ a similar
weakness in CityHash.
@ -265,9 +280,9 @@ hash secret
-----------
The ``_Py_HashSecret_t`` type of Python 2.6 to 3.3 has two members with either
32 or 64bit length each. SipHash requires two 64bit unsigned integers as keys.
32 or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys.
The typedef will be changed to an union with a guaranteed size of 128bits on
all architectures. On platforms with a 64bit data type it will have two
all architectures. On platforms with a 64-bit data type it will have two
``uint64`` members. Because C89 compatible compilers may not have ``uint64``
the union also has an array of 16 chars.
@ -473,7 +488,7 @@ Performance
TBD
First tests suggest that SipHash performs a bit faster on 64bit CPUs when
First tests suggest that SipHash performs a bit faster on 64-bit CPUs when
it is feed with medium size byte strings as well as ASCII and UCS2 Unicode
strings. For very short strings the setup costs for SipHash dominates its
speed but it is still in the same order of magnitude as the current FNV code.
@ -482,10 +497,22 @@ It's yet unknown how the new distribution of hash values affects collisions
of common keys in dicts of Python classes.
Serhiy Storchaka has shown in [issue16427]_ that a modified FNV
implementation with 64bits per cycle is able to process long strings several
implementation with 64-bits per cycle is able to process long strings several
times faster than the current FNV implementation.
Grand Unified Python Benchmark Suite
------------------------------------
Initial tests with an experimental implementation and the Grand Unified Python
Benchmark Suite have shown minimal deviations. The summarized total runtime
of the benchmark is within 1% of the runtime of an unmodified Python 3.4
binary. The tests were run on an Intel i7-2860QM machine with a 64-bit Linux
installation. The interpreter was compiled with GCC 4.7 for 64 and 32-bit.
More benchmarks will be conducted.
Backwards Compatibility
=======================