murmur hash v3 in MySQL utf8 compatible
Murmur hash v3 in MySQL utf8 compatible
These days for performance purpose, I needed to replace some autoincrement
Id
by a hash calculated one, in order to have a determinist way to retrieve from
an object, it’s record in database, and easily recognize duplicate.
For that I have chosen the murmur hash v3 function
For new records, I could use the python module (leveraging C code) mmh3
But as I needed to update million of existing record, I couldn’t possibly imagine doing that by calling a python script.
First solution: user defined function copy/pasted from a gist
Instead I found that somebody already created a user defined function for that and posted it on a gist here
With that I could now do
UPDATE my_table
SET hashed_field = murmur_hash_v3(
CONCAT(other_field, another_field),
0
);
and update in a minute my records which would have normally taken a good hour with a python script iterating on the records
Problem: it does not handle utf-8 strings
But we soon run onto a problem with utf-8 records, the hash were different between python and mysql version of the hash
Solutions, plitting by bytes instead of character
The problem is that ascii()
methods in MySQL is not made to handle non-ascii characters, and that the methods is iterating on each character
so to solved the problem, we first casted the string into a binary
type.
SET keyx = CAST(key_char AS BINARY);
That way, the split was now done on bytes instead of characters
The function now:
Now you can copy paste this version :)
DELIMMITER //
CREATE FUNCTION `murmur_hash_v3`(`key_char` TEXT, `seed` int unsigned)
RETURNS int unsigned
DETERMINISTIC
BEGIN
DECLARE keyx BLOB;
DECLARE remainder,bytes,c1,c2,i, m1,m2 INT unsigned;
DECLARE h1,k1,h1b BIGINT unsigned;
SET keyx = CAST(key_char AS BINARY);
SET remainder = LENGTH(keyx) & 3;
SET bytes = LENGTH(keyx) - remainder;
SET h1 = seed;
SET c1 = 0xcc9e2d51;
SET c2 = 0x1b873593;
SET m1 = 0x85ebca6b, m2 = 0xc2b2ae35;
SET i = 1;
WHILE i <= bytes DO
SET k1 =
(ascii(mid(keyx,i , 1)) & 0xff) |
((ascii(mid(keyx,i+1,1)) & 0xff) << 8) |
((ascii(mid(keyx,i+2,1)) & 0xff) << 16) |
((ascii(mid(keyx,i+3,1)) & 0xff) << 24)
;
SET i = i + 4;
SET k1 = (k1*c1) & 0xffffffff;
SET k1 = ((k1 << 15) | (k1 >> 17))& 0xffffffff;
SET k1 = (k1*c2) & 0xffffffff;
SET h1 = h1 ^ k1;
SET h1 = ((h1 << 13) | (h1 >> 19))& 0xffffffff;
SET h1b = (h1*5) & 0xffffffff;
SET h1 = (h1b+0xe6546b64)& 0xffffffff;
END WHILE;
SET k1 = 0;
IF remainder>=3 THEN SET k1 = k1^((ascii(mid(keyx,i + 2,1)) & 0xff) << 16); END IF;
IF remainder>=2 THEN SET k1 = k1^((ascii(mid(keyx,i + 1,1)) & 0xff) << 8); END IF;
IF remainder>=1 THEN SET k1 = k1^((ascii(mid(keyx,i + 0,1)) & 0xff) << 0);
SET k1 = (k1*c1) & 0xffffffff;
SET k1 = ((k1 << 15) | (k1 >> 17))& 0xffffffff;
SET k1 = (k1*c2) & 0xffffffff;
SET h1 = h1 ^ k1;
END IF;
SET h1 = h1 ^ LENGTH(keyx);
SET h1 = h1 ^ (h1 >> 16);
SET h1 = (h1*m1) & 0xffffffff;
SET h1 = h1 ^ (h1 >> 13);
SET h1 = (h1*m2) & 0xffffffff;
SET h1 = h1 ^ (h1 >> 16);
RETURN h1;
END;
//
DELIMITER ;