burtleburtle.net/bob/hash/doobs.html
Abstract I offer you a new hash function for hash table lookup that is faster and more thorough than the one you are using now. I also give you a way to verify that it is more thorough. The Hash Over the past two years I've built a general hash function for hash table lookup. Most of the two dozen old hashes I've replaced have had owners who wouldn't accept a new hash unless it was a plug-in replacement for their old hash, and was demonstrably better than the old hash. These old hashes defined my requirements: * The keys are unaligned variable-length byte arrays. Without further ado, here's the fastest hash I've been able to design that meets all the requirements. For every delta with one or two bits set, and the deltas of all three high bits or all three low bits, whether the original value of a,b,c is almost all zero or is uniformly distributed, * If mix() is run forward or backward, at least 32 bits in a,b,c have at least 1/4 probability of changing. Unfortunately, superscalar Pentiums and Sparcs can't take advantage of that parallelism. They've also turned some of those single-cycle latency instructions into multi-cycle latency instructions. Every bit of the key affects every bit of the return value. For example, if you need only 10 bits, do h = (h & hashmask(10)); In which case, the hash table should have hashsize(10) elements. If you are hashing n strings (ub1 **)k, do it like this: for (i=0, h=0;
Blocks of text are combined with the internal state (a,b,c) by addition. This combining step is the rest of the hash function, consuming the remaining 3n instructions. The only postprocessing is to choose c out of (a,b,c) to be the result. Mixing is done on three 4-byte registers rather than on a 1-byte quantity. Combining is done on 12-byte blocks, reducing the loop overhead. The final switch statement combines a variable-length block with the registers a,b,c without a loop. The Hash Must Do a Good Job The most interesting requirement was that the hash must be better than its competition. What does it mean for a hash to be good for hash table lookup? If you don't know the keys before choosing the function, the best you can do is map an equal number of possible keys to each hash value. If keys were distributed uniformly, an excellent hash would be to choose the first few bytes of the key and use that as the hash value. Choosing the first few bytes works quite poorly in practice. The real requirement then is that a good hash function should distribute hash values uniformly for the keys that users actually use. EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO 7369 SMITH CLERK 7902 17-DEC-80 800 20 7499 ALLEN SALESMAN 7698 20-FEB-81 1600 300 30 7521 WARD SALESMAN 7698 22-FEB-81 1250 500 30 7566 JONES MANAGER 7839 02-APR-81 2975 20 7654 MARTIN SALESMAN 7898 28-SEP-81 1250 1400 30 7698 BLAKE MANAGER 7539 01-MAY-81 2850 30 7782 CLARK MANAGER 7566 09-JUN-81 2450 10 7788 SCOTT ANALYST 7698 19-APR-87 3000 20 7839 KING PRESIDENT 17-NOV-81 5000 10 7844 TURNER SALESMAN 7698 08-SEP-81 1500 30 7876 ADAMS CLERK 7788 23-MAY-87 1100 0 20 7900 JAMES CLERK 7698 03-DEC-81 950 30 7902 FORD ANALYST 7566 03-DEC-81 3000 20 7934 MILLER CLERK 7782 23-JAN-82 1300 10 Consider each horizontal row to be a key. For example, all the keys are ASCII, so the high bit of every byte is zero. Keys often consist of substrings arranged in different orders. For example, the MGR of some keys is the EMPNO of others. The only difference between zero and no value at all may be the length of the value. Also, "aa aaa" and "aaa aa" should hash to different values. If the length is included in the data being hashed, then lengths are not a problem. If the hash does not treat text blocks commutatively, then substrings are not a problem. Strings that are mostly zeros can be tested by listing all strings with only one bit set and checking if that set of strings produces too many collisions. The remaining pattern is that keys often differ in only a few bits. If a hash allows small sets of input bits to cancel each other out, and the user keys differ in only those bits, then all keys will map to the same handful of hash values. A common weakness Usually, when a small set of input bits cancel each other out, it is because those input bits affect only a smaller set of bits in the internal state.
These keys differ in bit 3 of the first byte and bit 1 of the seventh byte. After the seventh bit is combined, any further postprocessing will do no good because the internal states are already the same. Any time n input bits can only affect m output bits, and n > m, then the 2^n keys that differ in those input bits can only produce 2^m distinct hash values. The same is true if n input bits can only affect m bits of the internal state -- later mixing may make the 2^m results look uniformly distributed, but there will still be only 2^m results. The function above has many sets of 2 bits that affect only 1 bit of the internal state. If there are n input bits, there are (n choose 2)=(n*n/2 - n/2) pairs of input bits, only a few of which match weaknesses in the function above. It is a common pattern for keys to differ in only a few bits. If those bits match one of a hash's weaknesses, which is a rare but not negligible event, the hash will do extremely bad. A small number of bits y of one input block are combined, affecting only y bits of the internal state. The mixing step causes those y bits of the internal state to affect only z bits of the internal state. The next combining step overwrites those bits with z more input bits, cancelling out the first y input bits. When z is smaller than the number of bits in the output, then y+z input bits have affected only z bits of the internal state, causing 2^y+z possible keys to produce at most 2^z hash values. Uncombine this block, causing y block bits to unaffect y bits of the internal state. Unmix the internal state, leaving x bits unaffected by the y bits from this block. Unmixing the previous block unaffects those x bits, cancelling out this block's y bits. If x is less than the number of bits in the output, then the 2^x+y keys differing in only those x+y input bits can produce at most 2^x hash values. Instead, it loses information about the earlier blocks every time it is applied, so keys differing only in the first few input blocks are more likely to collide. This test should be run on the reverse of the mixing function as well. It can also be run with all sets of 2 internal state bits, or all sets of 3. Another way this weakness can happen is if any bit in the final input block does not affect every bit of the output. Additive Hash ub4 additive(char *key, ub4 len, ub4 prime) { ub4 hash, i;
The size of tab 256 is the maximum number of input bytes. Universal hashing can be implemented by a Zobrist hash with carefully chosen table values. It takes 420 instructions to hash a block of 64 aligned bytes. I combined that with my hash's method of putting unaligned bytes into registers, adding 3n instructions. SIZE-1000 is the smallest reasonable hash table size greater than 1000. SPEED is the speed of the hash, measured in instructions required to produce a hash value for a table with SIZE-1000 buckets. INLINE This is the speed assuming the hash is inlined in a loop that has to walk through all the characters anyways, such as a tokenizer. Such a loop doesn't always exist, and even when it does inlining isn't always possible. Some hashes (my new hash and MD4) work on blocks larger than a character. Inlining a hash removes 3n+1 instructions of loop overhead. It also removes the n instructions needed to get the characters out of the key array and into a register. It allows the string to be converted to uppercase, and/or to unicode, before the hash is performed without the expense of an extra loop or a temporary buffer. FUNNEL-15 is the largest set of input bits affecting the smallest set of internal state bits when mapping 15-byte keys into a 1-byte result. FUNNEL-100 is the largest set of input bits affecting the smallest set of internal state bits when mapping 100-byte keys into a 32-bit result. COLLIDE-32 is the number of collisions found when a dictionary of 38,470 ...
|