Update PEP 456 to reflect the changes in features/pep-456
This commit is contained in:
parent
abfd6e8439
commit
1db087f304
135
pep-0456.txt
135
pep-0456.txt
|
@ -75,15 +75,13 @@ Requirements for a hash function
|
||||||
* It MUST support hashing of unaligned memory in order to support
|
* It MUST support hashing of unaligned memory in order to support
|
||||||
hash(memoryview).
|
hash(memoryview).
|
||||||
|
|
||||||
* It MUST NOT return ``-1``. The value is reserved for error cases and yet
|
|
||||||
uncached hash values. (Note: A special case can be added to map ``-1``
|
|
||||||
to ``-2``.)
|
|
||||||
|
|
||||||
* It is highly RECOMMENDED that the length of the input influences the
|
* It is highly RECOMMENDED that the length of the input influences the
|
||||||
outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
|
outcome, so that ``hash(b'\00') != hash(b'\x00\x00')``.
|
||||||
|
|
||||||
* It MAY return ``0`` for zero length input in order to disguise the
|
The internal interface code between the hash function and the tp_hash slots
|
||||||
randomization seed. (Note: This can be handled as special case, too.)
|
implements special cases for zero length input and a return value of ``-1``.
|
||||||
|
An input of length ``0`` is mapped to hash value ``0``. The output ``-1``
|
||||||
|
is mapped to ``-2``.
|
||||||
|
|
||||||
|
|
||||||
Current implementation with modified FNV
|
Current implementation with modified FNV
|
||||||
|
@ -306,52 +304,63 @@ new type definition::
|
||||||
``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
|
``_Py_HashSecret_t`` is initialized in ``Python/random.c:_PyRandom_Init()``
|
||||||
exactly once at startup.
|
exactly once at startup.
|
||||||
|
|
||||||
hash function
|
|
||||||
-------------
|
|
||||||
|
|
||||||
function prototype::
|
hash function definition
|
||||||
|
------------------------
|
||||||
|
|
||||||
typedef Py_hash_t (*PyHash_Func_t)(const void *, Py_ssize_t);
|
Implementation::
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
/* function pointer to hash function, e.g. fnv or siphash24 */
|
||||||
|
Py_hash_t (*const hash)(const void *, Py_ssize_t);
|
||||||
|
const char *name; /* name of the hash algorithm and variant */
|
||||||
|
const int hash_bits; /* internal size of hash value */
|
||||||
|
const int seed_bits; /* size of seed input */
|
||||||
|
} PyHash_FuncDef;
|
||||||
|
|
||||||
|
PyAPI_FUNC(PyHash_FuncDef*) PyHash_GetFuncDef(void);
|
||||||
|
|
||||||
|
|
||||||
|
autoconf
|
||||||
|
--------
|
||||||
|
|
||||||
|
A new test is added to the configure script. The test sets
|
||||||
|
``HAVE_ALIGNED_REQUIRED``, when it detects a platform, that requires aligned
|
||||||
|
memory access for integers. Must current platforms such as X86, X86_64 and
|
||||||
|
modern ARM don't need aligned data.
|
||||||
|
|
||||||
|
A new option ``--with-hash-algorithm`` enables the user to select a hash
|
||||||
|
algorithm in the configure step.
|
||||||
|
|
||||||
|
|
||||||
hash function selection
|
hash function selection
|
||||||
-----------------------
|
-----------------------
|
||||||
|
|
||||||
type definition::
|
The value of the macro ``PY_HASH_ALGORITHM`` defines which hash algorithm is
|
||||||
|
used internally. It may be set to any of the three values ``PY_HASH_SIPHASH24``,
|
||||||
|
``PY_HASH_FNV`` or ``PY_HASH_EXTERNAL``. If ``PY_HASH_ALGORITHM`` is not
|
||||||
|
defined at all, then the best available algorithm is selected. On platforms
|
||||||
|
wich don't require aligned memory access (``HAVE_ALIGNED_REQUIRED`` not
|
||||||
|
defined) and an unsigned 64bit integer type ``PY_UINT64_T``, SipHash24 is
|
||||||
|
used. On strict C89 platforms without a 64 bit data type, or architectures such
|
||||||
|
as SPARC, FNV is selected as fallback. A hash algorithm can be selected with
|
||||||
|
an autoconf option, for example ``./configure --with-hash-algorithm=fnv``.
|
||||||
|
|
||||||
#define PY_HASH_SIPHASH24 0x53495024
|
The value ``PY_HASH_EXTERNAL`` allows 3rd parties to provide their own
|
||||||
#define PY_HASH_FNV 0x464E56
|
implementation at compile time.
|
||||||
|
|
||||||
#ifndef PY_HASH_ALGORITHM
|
|
||||||
#if defined(PY_UINT64_T) && defined(PY_UINT32_T)
|
|
||||||
#define PY_HASH_ALGORITHM PY_HASH_SIPHASH24
|
|
||||||
#else
|
|
||||||
#define PY_HASH_ALGORITHM PY_HASH_FNV
|
|
||||||
#endif /* uint64_t && uint32_t */
|
|
||||||
#endif /* PY_HASH_ALGORITHM */
|
|
||||||
|
|
||||||
typedef struct {
|
|
||||||
PyHash_Func_t hash; /* function pointer */
|
|
||||||
char *name; /* name of the hash algorithm and variant */
|
|
||||||
int hash_bits; /* internal size of hash value */
|
|
||||||
int seed_bits; /* size of seed input */
|
|
||||||
} PyHash_FuncDef;
|
|
||||||
|
|
||||||
PyAPI_DATA(PyHash_FuncDef) PyHash_Func;
|
|
||||||
|
|
||||||
Implementation::
|
Implementation::
|
||||||
|
|
||||||
#if PY_HASH_ALGORITHM == PY_HASH_FNV
|
#if PY_HASH_ALGORITHM == PY_HASH_EXTERNAL
|
||||||
PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
|
extern PyHash_FuncDef PyHash_Func;
|
||||||
|
#elif PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
|
||||||
|
static PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
|
||||||
|
#elif PY_HASH_ALGORITHM == PY_HASH_FNV
|
||||||
|
static PyHash_FuncDef PyHash_Func = {fnv, "fnv", 8 * sizeof(Py_hash_t),
|
||||||
16 * sizeof(Py_hash_t)};
|
16 * sizeof(Py_hash_t)};
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
#if PY_HASH_ALGORITHM == PY_HASH_SIPHASH24
|
|
||||||
PyHash_FuncDef PyHash_Func = {siphash24, "siphash24", 64, 128};
|
|
||||||
#endif
|
|
||||||
|
|
||||||
TODO: select hash algorithm with autoconf variable
|
|
||||||
|
|
||||||
|
|
||||||
Python API addition
|
Python API addition
|
||||||
===================
|
===================
|
||||||
|
@ -378,34 +387,37 @@ to the object to reflect the active hash algorithm and its properties.
|
||||||
Necessary modifications to C code
|
Necessary modifications to C code
|
||||||
=================================
|
=================================
|
||||||
|
|
||||||
_Py_HashBytes (Objects/object.c)
|
_Py_HashBytes() (Objects/object.c)
|
||||||
--------------------------------
|
----------------------------------
|
||||||
|
|
||||||
``_Py_HashBytes`` is an internal helper function that provides the hashing
|
``_Py_HashBytes`` is an internal helper function that provides the hashing
|
||||||
code for bytes, memoryview and datetime classes. It currently implements FNV
|
code for bytes, memoryview and datetime classes. It currently implements FNV
|
||||||
for ``unsigned char*``. The function can either be modified to use the new
|
for ``unsigned char *``.
|
||||||
API or it could be completely removed to avoid an unnecessary level of
|
|
||||||
indirection.
|
|
||||||
|
|
||||||
|
The function is moved to Python/pyhash.c and modified to use the hash function
|
||||||
|
through PyHash_Func.hash(). The function signature is altered to take
|
||||||
|
a ``const void *`` as first argument. ``_Py_HashBytes`` also takes care of
|
||||||
|
special cases. It maps zero length input to ``0`` and return value of ``-1``
|
||||||
|
to ``-2``.
|
||||||
|
|
||||||
bytes_hash (Objects/bytesobject.c)
|
bytes_hash() (Objects/bytesobject.c)
|
||||||
----------------------------------
|
------------------------------------
|
||||||
|
|
||||||
``bytes_hash`` uses ``_Py_HashBytes`` to provide the tp_hash slot function
|
``bytes_hash`` uses ``_Py_HashBytes`` to provide the tp_hash slot function
|
||||||
for bytes objects. If ``_Py_HashBytes`` is to be removed then ``bytes_hash``
|
for bytes objects. The function will continue to use ``_Py_HashBytes``
|
||||||
must be reimplemented.
|
but withoht a type cast.
|
||||||
|
|
||||||
|
memory_hash() (Objects/memoryobject.c)
|
||||||
memory_hash (Objects/memoryobject.c)
|
--------------------------------------
|
||||||
------------------------------------
|
|
||||||
|
|
||||||
``memory_hash`` provides the tp_hash slot function for read-only memory
|
``memory_hash`` provides the tp_hash slot function for read-only memory
|
||||||
views if the original object is hashable, too. It's the only function that
|
views if the original object is hashable, too. It's the only function that
|
||||||
has to support hashing of unaligned memory segments in the future.
|
has to support hashing of unaligned memory segments in the future. The
|
||||||
|
function will continue to use ``_Py_HashBytes`` but withoht a type cast.
|
||||||
|
|
||||||
|
|
||||||
unicode_hash (Objects/unicodeobject.c)
|
unicode_hash() (Objects/unicodeobject.c)
|
||||||
--------------------------------------
|
----------------------------------------
|
||||||
|
|
||||||
``unicode_hash`` provides the tp_hash slot function for unicode. Right now it
|
``unicode_hash`` provides the tp_hash slot function for unicode. Right now it
|
||||||
implements the FNV algorithm three times for ``unsigned char*``, ``Py_UCS2``
|
implements the FNV algorithm three times for ``unsigned char*``, ``Py_UCS2``
|
||||||
|
@ -416,12 +428,12 @@ multiplied with the size of the internal unicode kind::
|
||||||
|
|
||||||
if (PyUnicode_READY(u) == -1)
|
if (PyUnicode_READY(u) == -1)
|
||||||
return -1;
|
return -1;
|
||||||
x = PyHash_Func.hash(PyUnicode_DATA(u),
|
x = _Py_HashBytes(PyUnicode_DATA(u),
|
||||||
PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
|
PyUnicode_GET_LENGTH(u) * PyUnicode_KIND(u));
|
||||||
|
|
||||||
|
|
||||||
generic_hash (Modules/_datetimemodule.c)
|
generic_hash() (Modules/_datetimemodule.c)
|
||||||
----------------------------------------
|
------------------------------------------
|
||||||
|
|
||||||
``generic_hash`` acts as a wrapper around ``_Py_HashBytes`` for the tp_hash
|
``generic_hash`` acts as a wrapper around ``_Py_HashBytes`` for the tp_hash
|
||||||
slots of date, time and datetime types. timedelta objects are hashed by their
|
slots of date, time and datetime types. timedelta objects are hashed by their
|
||||||
|
@ -459,8 +471,17 @@ it is fed with medium size byte strings as well as ASCII and UCS2 Unicode
|
||||||
strings. For very short strings the setup cost for SipHash dominates its
|
strings. For very short strings the setup cost for SipHash dominates its
|
||||||
speed but it is still in the same order of magnitude as the current FNV code.
|
speed but it is still in the same order of magnitude as the current FNV code.
|
||||||
|
|
||||||
It's yet unknown how the new distribution of hash values affects collisions
|
|
||||||
of common keys in dicts of Python classes.
|
Hash value distribution
|
||||||
|
-----------------------
|
||||||
|
|
||||||
|
A good distribution of hash values is important for dict and set performance.
|
||||||
|
Both SipHash24 and FNV take the length of the input into account, so that
|
||||||
|
strings made up entirely of NULL bytes don't have the same hash value. The
|
||||||
|
last bytes of the input tend to affect the least significant bits of the hash
|
||||||
|
value, too. That attribute reduces the amount of hash collisions for strings
|
||||||
|
with a common prefix.
|
||||||
|
|
||||||
|
|
||||||
Typical length
|
Typical length
|
||||||
--------------
|
--------------
|
||||||
|
|
Loading…
Reference in New Issue