SimonHF's Blog

Just another WordPress.com site

Murmur Hash: very fast and collision resistant? November 1, 2008

Filed under: Uncategorized — simonhf @ 9:30 pm

Being interested in everything related to hashes then I was very pleased to recently come across MurmurHash 2.0 which is said to be faster and more collision resistant than all other publically available 32bit hashing algorithms.

The author, Austin Appleby, also makes an impressive claim:

“Murmur is close enough to random to be suitable for virtually all
keysets, though it does have the interesting property that it produces
0 collisions if the keyset is all 4-byte integers from 0 to 2^32-1 –
this is a side effect of it doing mixes on the full 4-byte hash state
using only reversible operations.”

This claim is so impressive that I decided to test it. Here is the source code for murmurhash2.c:

// Code taken from http://murmurhash.googlepages.com/MurmurHash2.cpp

//-----------------------------------------------------------------------------
// MurmurHash2, by Austin Appleby

// Note - This code makes a few assumptions about how your machine behaves -

// 1. We can read a 4-byte value from any address without crashing
// 2. sizeof(int) == 4

// And it has a few limitations -

// 1. It will not work incrementally.
// 2. It will not produce the same results on little-endian and big-endian
//    machines.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
        // 'm' and 'r' are mixing constants generated offline.
        // They're not really 'magic', they just happen to work well.

        const unsigned int m = 0x5bd1e995;
        const int r = 24;

        // Initialize the hash to a 'random' value

        unsigned int h = seed ^ len;

        // Mix 4 bytes at a time into the hash

        const unsigned char * data = (const unsigned char *)key;

        while(len >= 4)
        {
                unsigned int k = *(unsigned int *)data;

                k *= m;
                k ^= k >> r;
                k *= m;

                h *= m;
                h ^= k;

                data += 4;
                len -= 4;
        }

        // Handle the last few bytes of the input array

        switch(len)
        {
        case 3: h ^= data[2] << 16;
        case 2: h ^= data[1] << 8;
        case 1: h ^= data[0];
                h *= m;
        };

        // Do a few final mixes of the hash to ensure the last few
        // bytes are well-incorporated.

        h ^= h >> 13;
        h *= m;
        h ^= h >> 15;

        return h;
}

And here is the source code for the test program test-collisions-for-2pow32-032bit-keys.c:

#include <stdio.h>
#include <malloc.h>
#include "murmurhash2.c"

/**
 * Copyright (c) 2008 Simon Hardy-Francis
 *
 * Permission is hereby granted, free of charge, to any person
 * obtaining a copy of this software and associated documentation
 * files (the "Software"), to deal in the Software without
 * restriction, including without limitation the rights to use,
 * copy, modify, merge, publish, distribute, sublicense, and/or sell
 * copies of the Software, and to permit persons to whom the
 * Software is furnished to do so, subject to the following
 * conditions:
 *
 * The above copyright notice and this permission notice shall be
 * included in all copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
 * OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
 * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
 * HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
 * WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
 * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
 * OTHER DEALINGS IN THE SOFTWARE.
 *
 * Except as contained in this notice, the name(s) of the above
 * copyright holders shall not be used in advertising or otherwise
 * to promote the sale, use or other dealings in this Software
 * without prior written authorization.
 */

#define TWO_POW_32 4294967296
#define TWO_POW_32_DIV_8 536870912
#define BYTE_STATS_MAX 4096

unsigned char bit_masks[8] = {1,2,4,8,16,32,64,128};
unsigned int byte_stats[BYTE_STATS_MAX];

int main(void)
{
    union {
       char as_u08[24];
       char as_s08[24];
       unsigned int as_u32[6];
    } key;
    unsigned int hash32;
    unsigned int i,j,k,l,m,n;
    unsigned int byte_index;
    unsigned int bit_index;
    unsigned int bit_mask;
    unsigned int collisions_so_far;
    double dcollisions_so_far;
    double di;
    char * bits = malloc(TWO_POW_32_DIV_8);

    collisions_so_far = 0;

    for ( i = 0 ; i < BYTE_STATS_MAX ; ++ i )
    {
        byte_stats[i] = 0;
    }

    for ( j = 0 ; j < (TWO_POW_32_DIV_8) ; ++ j )
    {
        bits[j] = 0;
    }

    for ( i = 0 ; i < (TWO_POW_32-1) ; ++ i )
    //for ( i = 0 ; i < (500000001) ; ++ i )
    {
       key.as_u32[0] = i;
       key.as_u32[1] = i;
       key.as_u32[2] = i;
       key.as_u32[3] = i;
       hash32 = MurmurHash2 ( &key.as_s08[0], 4, (0xdeadbeef * 16) );
       if ( hash32 < BYTE_STATS_MAX )
       {
           byte_stats[hash32] ++;
       }
       byte_index = hash32;
       bit_index  = hash32;
       byte_index >>= 3;   
       bit_index &= 7;
       bit_mask = bit_masks[bit_index];
       if(((i % 10000000) == 0)
       && (i != 0))
       {
          di = i;
          dcollisions_so_far = collisions_so_far;
          printf("%10u collisions_so_far or %4.1f%% of %10u ", collisions_so_far, (dcollisions_so_far/di*100), i);
          /* draw graph 😉 */
          k = (dcollisions_so_far/di*100);
          for ( j = 0 ; j < (k/2) ; ++ j )
          {
             printf(" ");
          }
          printf("*\n");
       }
       if((bits[byte_index] & bit_mask ) != 0)
       {
          collisions_so_far ++;
       } else {
          bits[byte_index] |= bit_mask;
       }
    }

    for ( i = 0 ; i < BYTE_STATS_MAX ; ++ i )
    {
        if ( byte_stats[i] == 0 )
        {
                printf("-");
        }
        else
        {
            if ( byte_stats[i] >= 10 )
            {
                printf("(%d)", byte_stats[i]);
            }
            else
            {
                printf("%d", byte_stats[i]);
            }
        }
    }
}

And here’s the output:

# gcc test-collisions-for-2pow32-032bit-keys.c -o test-collisions-for-2pow32-032bit-keys && ./test-collisions-for-2pow32-032bit-keys
         0 collisions_so_far or  0.0% of   10000000 *
         0 collisions_so_far or  0.0% of   20000000 *
         0 collisions_so_far or  0.0% of   30000000 *
         0 collisions_so_far or  0.0% of   40000000 *
         0 collisions_so_far or  0.0% of   50000000 *
...
         0 collisions_so_far or  0.0% of 4250000000 *
         0 collisions_so_far or  0.0% of 4260000000 *
         0 collisions_so_far or  0.0% of 4270000000 *
         0 collisions_so_far or  0.0% of 4280000000 *
         0 collisions_so_far or  0.0% of 4290000000 *
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

So the test program hashes 2^32 unique 32 bit input strings while detecting collisions and outputs results every 10 million iterations. We also capture a sample of 32bit hash value frequencies in the form of the first 4k of hash values and display these frequencies when the 2^32 loop is finished. If a hash value frequency is 0 (because it was never returned) then it is displayed as a dash ‘-‘, otherwise the frequency of the hash value is output. As expected, the output above shows 0 collisions during the 2^32 iterations and all 1s for the hash value frequencies.

Very nice I thought. So the cool thing about this hash algorithm is that it works on 32 bits of the input string to hash at a time. This is also one of the factors which makes it so fast. Most hashing algorithms only work on 8 bits of the input string at a time. A minority work on 16 bits of the input string at a time. Based on these initially very positive results I modified by test program to use bigger input strings. However, because I’m a very lazy developer I just duplicated the first 32 bits of the input string several times. Here is the diff showing how this new test program differs from the one above:

# diff test-collisions-for-2pow32-032bit-keys.c test-collisions-for-2pow32-128bit-keys.c
78c78
<        hash32 = MurmurHash2 ( &key.as_s08[0], 4, (0xdeadbeef * 16) );
---
>        hash32 = MurmurHash2 ( &key.as_s08[0], 16, (0xdeadbeef * 16) );

So now our input string to hash is 128 bits instead of 32 bits. The initial 32 bits of the input string to hash has been duplicated 4 times. And here is the output of the second test program:

# gcc test-collisions-for-2pow32-128bit-keys.c -o test-collisions-for-2pow32-128bit-keys && ./test-collisions-for-2pow32-128bit-keys
    746316 collisions_so_far or  7.5% of   10000000    *
   2758812 collisions_so_far or 13.8% of   20000000       *
   5913827 collisions_so_far or 19.7% of   30000000          *
   9966267 collisions_so_far or 24.9% of   40000000             *
  14697559 collisions_so_far or 29.4% of   50000000               *
...
4148620227 collisions_so_far or 97.6% of 4250000000                                                 *
4158620227 collisions_so_far or 97.6% of 4260000000                                                 *
4168620227 collisions_so_far or 97.6% of 4270000000                                                 *
4178620227 collisions_so_far or 97.6% of 4280000000                                                 *
4188620227 collisions_so_far or 97.6% of 4290000000                                                 *
--------------------(24)----------(40)----------------------------(20)-----------------------------------------------------------------------------------------------------------------------(88)--(32)---------4-----------------------------------------------------------------------------------------------8------8---------------------------------------(24)---------(52)-----------------------------(32)------------4---------------------------------------------------------------------------------------------------------(64)--------------------(180)--------------------------------------(36)----------------------4------(32)-----------------------------(12)------------------------------(26)--------(32)----------------------------------------8-----------------------------(16)----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(32)-----------------------------(48)-----------------------------4--------------------------------(16)------8------------(36)----------------(148)--(64)-----------------------------8---------(68)-------------------8---------(40)-------8----------------------------------(52)-----(24)---------(96)---(44)--------------------------------------------(48)--------------------(32)---------(16)-------------------------------------------------8---------(16)---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(16)--------------------------------------------------------------(40)-(12)--------------(112)----(16)-----------------(36)------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(28)-------------(68)-----(16)----(12)--------(68)----------8----------------------------------(76)--------------------------------------------(56)-----------------(88)-(80)---------------------------------------8--------------------------------------------------------------------------------------------------------------------------------------------------(36)-------------------(36)-------------------------------------------------------------------------------(24)-----(32)-----------------------------------------------------4-------------------(12)----------------------------------------------------------------------------------------(40)-------------------(20)-------------------4-----------------(140)-------------------(56)---------8-----------------------------(16)-(124)--------(24)--------(32)-------------------------------------------------8-------------------(40)-----------------------------------------------------------------------------------------(68)---------------------------------------------------------------------------------------------------(32)------------------------------------------------------------------------(96)-----------------------------------------------------------(88)-------------------8--(24)-----------------------------2-----------------------------(32)---------(60)------4--(28)------4--------------------------------------------------------------------------------------------------------------------------------------(128)--------------------------(112)----------------------------------------(24)---------------------------------------8-----------------------------(88)-------------------8-------------------------------------------------(32)-------------------(28)---------(24)--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(84)-------------------------------------------------(16)---------(76)-------------------(16)--------------------------------------------------------------------------------(12)-(40)-----------------(16)------------------------------(68)---------------------------------------------------------------(40)------------------------------

To my surprise the second program shows a 97.6% collision rate. The hash value frequency also reflects that many hash values in the first 4k range were never returned as a result. This result is soooo bad that I’m wondering if I made a mistake in the test programs somewhere. Maybe somebody else can re-write the short test and confirm the results?

Advertisements
 

2 Responses to “Murmur Hash: very fast and collision resistant?”

  1. Chris Johnson Says:

    This is a known flaw in murmur 2, see http://sites.google.com/site/murmurhash/murmurhash2flaw


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s