Add results from first tests with Python benchmark suite
document requirement of 64bit data type explain hash algorithm requirements with RFC 2119 keywords
This commit is contained in:
parent
24c5df69c6
commit
afc47ea51a
93
pep-0456.txt
93
pep-0456.txt
|
@ -39,7 +39,7 @@ implementation without patching and recompiling large parts of the interpreter.
|
|||
Embedders may want to choose a more suitable hash function.
|
||||
|
||||
Finally the current implementation code does not perform well. In the common
|
||||
case it only processes one or two bytes per cycle. On a modern 64bit processor
|
||||
case it only processes one or two bytes per cycle. On a modern 64-bit processor
|
||||
the code can easily be adjusted to deal with eight bytes at once.
|
||||
|
||||
This PEP proposes three major changes to the hash code for strings and bytes:
|
||||
|
@ -49,14 +49,42 @@ This PEP proposes three major changes to the hash code for strings and bytes:
|
|||
by well known security and crypto experts, it is safe to assume that its
|
||||
secure for the near future.
|
||||
|
||||
* The existing FNV code is kept for platforms without a 64-bit data type. The
|
||||
algorithm is optimized to process larger chunks per cycle.
|
||||
|
||||
* Calculation of the hash of strings and bytes is moved into a single API
|
||||
function instead of multiple specialized implementations in
|
||||
``Objects/object.c`` and ``Objects/unicodeobject.c``. The function takes a
|
||||
void pointer plus length and returns the hash for it.
|
||||
|
||||
* The algorithm can be selected by the user with an environment variable,
|
||||
command line argument or by an embedder with an API function. By default FNV
|
||||
and SipHash are available for selection.
|
||||
command line argument or with an API function (for embedders). FNV is
|
||||
guaranteed to exist on all platforms. SipHash is available on the majority
|
||||
of modern systems.
|
||||
|
||||
|
||||
Requirements for a hash function
|
||||
================================
|
||||
|
||||
* It MUST be able to hash arbitrarily large blocks of memory from 1 byte up
|
||||
to the maximum ``ssize_t`` value.
|
||||
|
||||
* It MUST produce at least 32 bits on 32-bit platforms and at least 64 bits
|
||||
on 64-bit platforms. (Note: Larger outputs can be compressed with e.g.
|
||||
``v ^ (v >> 32)``.)
|
||||
|
||||
* It MUST support hashing of unaligned memory in order to support
|
||||
hash(memoryview).
|
||||
|
||||
* It MUST NOT return ``-1``. The value is reserved for error cases and yet
|
||||
uncached hash values. (Note: A special case can be added to map ``-1``
|
||||
to ``-2``.)
|
||||
|
||||
* It is highly RECOMMENDED that the length of the input influences the
|
||||
outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
|
||||
|
||||
* It MAY return ``0`` for zero length input in order to disguise the
|
||||
randomization seed. (Note: This can be handled as special case, too.)
|
||||
|
||||
|
||||
Current implementation with modified FNV
|
||||
|
@ -125,28 +153,6 @@ this PEP strongly believes that the nature of a non-cryptographic hash
|
|||
function makes it impossible to conceal the secrets.
|
||||
|
||||
|
||||
Requirements for a hash function
|
||||
================================
|
||||
|
||||
|
||||
|
||||
* It must be able to hash arbitrarily large blocks of memory from 1 bytes up
|
||||
to the maximum ``ssize_t`` value.
|
||||
|
||||
* It must produce at least 32bit values on 32bit platforms and at least 64bit
|
||||
values on 64bit platforms. (Note: Larger outputs can be compressed with e.g.
|
||||
``v ^ (v >> 32)``.)
|
||||
|
||||
* It must support hashing of unaligned memory in order to support
|
||||
hash(memoryview).
|
||||
|
||||
* It must not return ``-1``. It` either stands for error or missing hash value.
|
||||
(Note: A special case can be added to map ``-1`` to ``-2``.)
|
||||
|
||||
* It should return ``0`` for zero length input. (Note: This can be handled as
|
||||
special case, too.)
|
||||
|
||||
|
||||
Examined hashing algorithms
|
||||
===========================
|
||||
|
||||
|
@ -157,7 +163,7 @@ SipHash
|
|||
-------
|
||||
|
||||
SipHash [sip]_ is a cryptographic pseudo random function with a 128bit seed and
|
||||
64bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
|
||||
64-bit output. It was designed by Jean-Philippe Aumasson and Daniel J.
|
||||
Bernstein as a fast and secure keyed hash algorithm. It's used by Ruby, Perl,
|
||||
OpenDNS, Rust, Redis, FreeBSD and more. The C reference implementation has
|
||||
been released under CC0 license (public domain).
|
||||
|
@ -181,6 +187,9 @@ uint64_t)::
|
|||
unsigned long src_sz,
|
||||
const char k[16]);
|
||||
|
||||
SipHash requires a 64-bit data type and is not compatible with pure C89
|
||||
platforms.
|
||||
|
||||
|
||||
MurmurHash
|
||||
----------
|
||||
|
@ -188,7 +197,7 @@ MurmurHash
|
|||
MurmurHash [murmur]_ is a family of non-cryptographic keyed hash function
|
||||
developed by Austin Appleby. Murmur3 is the latest and fast variant of
|
||||
MurmurHash. The C++ reference implementation has been released into public
|
||||
domain. It features 32 or 128bit output with a 32bit seed. (Note: The out
|
||||
domain. It features 32 or 128bit output with a 32-bit seed. (Note: The out
|
||||
parameter is a buffer with either 1 or 4 bytes.)
|
||||
|
||||
Murmur3's function prototypes are::
|
||||
|
@ -208,6 +217,9 @@ Murmur3's function prototypes are::
|
|||
uint32_t seed,
|
||||
void *out);
|
||||
|
||||
The 128bit variants requires a 64-bit data type and are not compatible with
|
||||
pure C89 platforms. The 32-bit variant is fully C89-compatible.
|
||||
|
||||
Aumasson, Bernstein and Boßlet have shown [sip]_ [ocert-2012-001]_ that
|
||||
Murmur3 is not resilient against hash collision attacks. Therefore Murmur3
|
||||
can no longer be considered as secure algorithm. It still may be an
|
||||
|
@ -220,15 +232,18 @@ CityHash [city]_ is a family of non-cryptographic hash function developed by
|
|||
Geoff Pike and Jyrki Alakuijala for Google. The C++ reference implementation
|
||||
has been released under MIT license. The algorithm is partly based on
|
||||
MurmurHash and claims to be faster. It supports 64 and 128 bit output with a
|
||||
128bit seed as well as 32bit output without seed.
|
||||
128bit seed as well as 32-bit output without seed.
|
||||
|
||||
::
|
||||
The relevant function prototype for 64-bit CityHash with 128bit seed is::
|
||||
|
||||
uint64 CityHash64WithSeeds(const char *buf,
|
||||
size_t len,
|
||||
uint64 seed0,
|
||||
uint64 seed1)
|
||||
|
||||
CityHash also offers SSE 4.2 optimizations with CRC32 intrinsic for long
|
||||
inputs. All variants except CityHash32 require 64-bit data types. CityHash32
|
||||
uses only 32-bit data types but it doesn't support seeding.
|
||||
|
||||
Like MurmurHash Aumasson, Bernstein and Boßlet have shown [sip]_ a similar
|
||||
weakness in CityHash.
|
||||
|
@ -265,9 +280,9 @@ hash secret
|
|||
-----------
|
||||
|
||||
The ``_Py_HashSecret_t`` type of Python 2.6 to 3.3 has two members with either
|
||||
32 or 64bit length each. SipHash requires two 64bit unsigned integers as keys.
|
||||
32 or 64-bit length each. SipHash requires two 64-bit unsigned integers as keys.
|
||||
The typedef will be changed to an union with a guaranteed size of 128bits on
|
||||
all architectures. On platforms with a 64bit data type it will have two
|
||||
all architectures. On platforms with a 64-bit data type it will have two
|
||||
``uint64`` members. Because C89 compatible compilers may not have ``uint64``
|
||||
the union also has an array of 16 chars.
|
||||
|
||||
|
@ -473,7 +488,7 @@ Performance
|
|||
|
||||
TBD
|
||||
|
||||
First tests suggest that SipHash performs a bit faster on 64bit CPUs when
|
||||
First tests suggest that SipHash performs a bit faster on 64-bit CPUs when
|
||||
it is feed with medium size byte strings as well as ASCII and UCS2 Unicode
|
||||
strings. For very short strings the setup costs for SipHash dominates its
|
||||
speed but it is still in the same order of magnitude as the current FNV code.
|
||||
|
@ -482,10 +497,22 @@ It's yet unknown how the new distribution of hash values affects collisions
|
|||
of common keys in dicts of Python classes.
|
||||
|
||||
Serhiy Storchaka has shown in [issue16427]_ that a modified FNV
|
||||
implementation with 64bits per cycle is able to process long strings several
|
||||
implementation with 64-bits per cycle is able to process long strings several
|
||||
times faster than the current FNV implementation.
|
||||
|
||||
|
||||
Grand Unified Python Benchmark Suite
|
||||
------------------------------------
|
||||
|
||||
Initial tests with an experimental implementation and the Grand Unified Python
|
||||
Benchmark Suite have shown minimal deviations. The summarized total runtime
|
||||
of the benchmark is within 1% of the runtime of an unmodified Python 3.4
|
||||
binary. The tests were run on an Intel i7-2860QM machine with a 64-bit Linux
|
||||
installation. The interpreter was compiled with GCC 4.7 for 64 and 32-bit.
|
||||
|
||||
More benchmarks will be conducted.
|
||||
|
||||
|
||||
Backwards Compatibility
|
||||
=======================
|
||||
|
||||
|
|
Loading…
Reference in New Issue