So there is no silver-bullet mechanism for testing the effectiveness of hash functions. Things like the avalanche effect are very important. Here we present the results of a brute force approach to testing the hash collision vulnerabilities of various hash functions. The hash functions to be tested are FNV (32bit), lookup3 (32bit), lookup3 (64bit), murmurhash2 (32bit), murmurhash2_160 (160bits), SHA1 (160bits), SuperFastHash (32bits), and SuperFastHash_64 (64bits). We test each hash function by generating 2^32 unique 24 byte strings, each of which gets hashed. The 2^32 24 byte strings are created by generating all possible 32bit unsigned integers and repeating six times; so the third unique 24 byte string will be 0×000000020000000200000002000000020000000200000002 and so on. We then record the value of each resulting 32bit hash in a 2^32 element array (read: buckets). At the end of the test we determine which percentage of array elements are empty. This percentage is also the percentage of hash collision. If a hash function returns multiple 32bit hashes then we test each of the resulting hash values as a separate test, e.g. a 64bit hash function would be tested twice; once for the lower 32bit hash result and once for the upper 32bit hash result. In addition, we test combinations of hash results in the same test by cycling through the hash results (read: round robin), e.g. a 64bit hash function would use the lower 32bit hash result for unique 24 byte string #1, and the upper 32bit hash result for unique 24 byte string #2, and the lower 32bit hash result for unique 24 byte string #3, and so on. Here are the results:

The best result is 36.8% hash collision for lookup3 (32bit; 4.4 million hashes/second), lookup3 (64bit; 4.4 million hashes/second), murmurhash2 (160bit; but not the round robin; 4.0 million hashes/second), and SHA1 (160bit; 0.9 million hashes/second). If anybody cares to comment, I’d be interested in understanding the math behind why it’s not possible to get a result with a lower hash collision rate than 36.8% for a large sample like this.

### Like this:

Like Loading...

*Related*

36.8% is exactly what you would expect – it is 1/e

You are putting N balls into N buckets (where N is large) and counting the number of collisions:

The probability of a particular ball going into a particular bucket is 1/N

so the probability of that ball not going into that bucket is 1-1/N

so the probably none of the balls go into that bucket is (1-1/N)^N

which for large N is approximately 1/e

So after all the balls have been thrown each bucket has a 1/e probablity of being empty – ie you expect there to be N/e empty buckets. Since number of balls = number of buckets, each empty bucket means a ball instead went into an already occupied bucket – ie number of empty buckets = number of collisions.

So on average there will be N/e collisions for N balls

ie the average probability of a collision is 1/e = 36.8%

For any “random” hash function with no collision anomalies this is the best expected case.