murmur hash v3 in MySQL utf8 compatible

  • March 20, 2016

Murmur hash v3 in MySQL utf8 compatible

These days for performance purpose, I needed to replace some autoincrement Id by a hash calculated one, in order to have a determinist way to retrieve from an object, it’s record in database, and easily recognize duplicate.

For that I have chosen the murmur hash v3 function

For new records, I could use the python module (leveraging C code) mmh3

But as I needed to update million of existing record, I couldn’t possibly imagine doing that by calling a python script.

First solution: user defined function copy/pasted from a gist

Instead I found that somebody already created a user defined function for that and posted it on a gist here

With that I could now do

UPDATE my_table
SET hashed_field = murmur_hash_v3(
    CONCAT(other_field, another_field),
    0
);

and update in a minute my records which would have normally taken a good hour with a python script iterating on the records

Problem: it does not handle utf-8 strings

But we soon run onto a problem with utf-8 records, the hash were different between python and mysql version of the hash

Solutions, plitting by bytes instead of character

The problem is that ascii() methods in MySQL is not made to handle non-ascii characters, and that the methods is iterating on each character

so to solved the problem, we first casted the string into a binary type.

SET keyx = CAST(key_char AS BINARY);

That way, the split was now done on bytes instead of characters

The function now:

Now you can copy paste this version :)

        DELIMMITER //

       CREATE FUNCTION `murmur_hash_v3`(`key_char` TEXT, `seed` int unsigned)
            RETURNS int unsigned
            DETERMINISTIC
        BEGIN
            DECLARE keyx BLOB;
            DECLARE remainder,bytes,c1,c2,i, m1,m2 INT unsigned;
            DECLARE h1,k1,h1b BIGINT unsigned;
            SET keyx = CAST(key_char AS BINARY);
            SET remainder = LENGTH(keyx) & 3;
            SET bytes = LENGTH(keyx) - remainder;
            SET h1 = seed;
            SET c1 = 0xcc9e2d51;
            SET c2 = 0x1b873593;
            SET m1 = 0x85ebca6b, m2 = 0xc2b2ae35;
            SET i = 1;

            WHILE i <= bytes DO
                SET k1 =
                     (ascii(mid(keyx,i , 1)) & 0xff)        |
                    ((ascii(mid(keyx,i+1,1)) & 0xff) << 8)  |
                    ((ascii(mid(keyx,i+2,1)) & 0xff) << 16) |
                    ((ascii(mid(keyx,i+3,1)) & 0xff) << 24)
                  ;
                SET i = i + 4;

                SET k1 = (k1*c1) & 0xffffffff;
                SET k1 = ((k1 << 15) | (k1 >> 17))& 0xffffffff;
                SET k1 = (k1*c2) & 0xffffffff;

                SET h1 = h1 ^ k1;
                SET h1 = ((h1 << 13) | (h1 >> 19))& 0xffffffff;
                SET h1b = (h1*5) & 0xffffffff;
                SET h1 = (h1b+0xe6546b64)& 0xffffffff;
            END WHILE;

            SET k1 = 0;

            IF remainder>=3 THEN SET k1 = k1^((ascii(mid(keyx,i + 2,1)) & 0xff) << 16); END IF;
            IF remainder>=2 THEN SET k1 = k1^((ascii(mid(keyx,i + 1,1)) & 0xff) <<  8); END IF;
            IF remainder>=1 THEN SET k1 = k1^((ascii(mid(keyx,i + 0,1)) & 0xff) <<  0);
                SET k1 = (k1*c1) & 0xffffffff;
                SET k1 = ((k1 << 15) | (k1 >> 17))& 0xffffffff;
                SET k1 = (k1*c2) & 0xffffffff;
                SET h1 = h1 ^ k1;
            END IF;

            SET h1 = h1 ^ LENGTH(keyx);
            SET h1 = h1 ^ (h1 >> 16);
            SET h1 = (h1*m1) & 0xffffffff;

            SET h1 = h1 ^ (h1 >> 13);
            SET h1 = (h1*m2) & 0xffffffff;
            SET h1 = h1 ^ (h1 >> 16);

            RETURN h1;
        END;
    //
    DELIMITER ;