A while back we had an interesting problem at work. We were calculating MD5 hashes for some values in both .Net and SQL Server and although the input values were all the same our MD5 hashes were different. After some time spent looking dumbfounded at the code I’ve realized what the bug was and started laughing. I saw that HashBytes function was the “culprit”. HashBytes function was introduced in SQL server 2005 to simplify creating hashes in the database. It can convert values to MD2, MD4, MD5, SHA, or SHA1 formats. HashBytes' result depends on the input text’s data type and can at the first glance give a bit different results that we might think, especially when combined with the .Net framework. Of course hashing itself works correctly in both. Let’s see what going on with some examples. From here on when i say hash i mean MD5 hash.
Example in SQL Server
-- note the size 15 of all datatypes declare @val1 varchar(15), @val2 nvarchar(15), @val3 char(15), @val4 nchar(15) -- example of 1 byte/char text -- all variables are of same length select @val1 = '1234567890', @val2 = N'1234567890', @val3 = '1234567890', @val4 = N'1234567890' -- all 4 return different results select HASHBYTES('md5', @val1) as MD5_varchar, -- result = 0xE807F1FCF82D132F9BB018CA6738A19F -- just to show that collation doesn't change the hash HASHBYTES('md5', @val1 collate Cyrillic_General_BIN2) as MD5_varchar_collation, -- result = 0xE807F1FCF82D132F9BB018CA6738A19F HASHBYTES('md5', @val2) as MD5_Nvarchar, -- result = 0xE15E31C3D8898C92AB172A4311BE9E84 HASHBYTES('md5', @val3) as MD5_char, -- result = 0x2120C3F3423F89BA8A65ABD933321884 HASHBYTES('md5', @val4) as MD5_Nchar -- result = 0x90DEF5840F3A31174CA44E2022F743B6 -- example of 2 bytes/char text
-- 中文 means Chinese in written text -- converting 2 bytes/char text to varchar cuts the text in half select @val1 = '中文', -- this cuts the text in half @val2 = N'中文', -- this stores the whole text @val3 = '中文', -- this cuts the text in half @val4 = N'中文' -- this stores the whole text -- all 4 return different results select HASHBYTES('md5', @val1) as MD5_varchar, -- result = 0xEA03FCB8C47822BCE772CF6C07D0EBBB HASHBYTES('md5', @val2) as MD5_Nvarchar, -- result = 0x73C6C8CD2F94355EF015E5265D5E65B1 HASHBYTES('md5', @val3) as MD5_char, -- result = 0xA13C45A38853677887B4839071537634 HASHBYTES('md5', @val4) as MD5_Nchar -- result = 0xEADEBD3BD72A481C43C828E0C550145C
The catch here is the data type difference. Nvarchar and nchar take twice more bytes to store data than varchar and char. Since char and nchar pad the right side of the string with spaces to fill the gap up to defined data type length the spaces are also used in MD5 calculation. A collation plays no part in calculating hash values since collations are only applied to sorts and comparisons.
However care must be taken with texts that need 2 bytes per char of storage space like Chinese text. Storing it in a varchar variable cuts it in half.
Example in .Net - C#
For generating MD5 hashes I’ve used the code from this site which turned up as the first result on Google for “MD5 in C#”:
public string GetMD5Hash(string input) { System.Security.Cryptography.MD5CryptoServiceProvider x = new System.Security.Cryptography.MD5CryptoServiceProvider(); //byte[] bs = System.Text.Encoding.ASCII.GetBytes(input); //byte[] bs = System.Text.Encoding.UTF7.GetBytes(input); //byte[] bs = System.Text.Encoding.UTF8.GetBytes(input); byte[] bs = System.Text.Encoding.Unicode.GetBytes(input); //byte[] bs = System.Text.Encoding.UTF32.GetBytes(input); bs = x.ComputeHash(bs); System.Text.StringBuilder s = new System.Text.StringBuilder(); foreach (byte b in bs) { s.Append(b.ToString("x2").ToLower()); } string password = s.ToString(); return password; } // hashes for the simple “1234567890” text in all encodings ASCII: e807f1fcf82d132f9bb018ca6738a19f UTF7: e807f1fcf82d132f9bb018ca6738a19f UTF8: e807f1fcf82d132f9bb018ca6738a19f Unicode: e15e31c3d8898c92ab172a4311be9e84 UTF32: 6a57502c29a5081f03cb70e0ad38ecc7 // hashes for the complex “中文” text in all encodings ASCII: ea03fcb8c47822bce772cf6c07d0ebbb UTF7: eb02105e5c51a33f21e8da7f8102cfda UTF8: a7bac2239fcdcb3a067903d8077c4a07 Unicode: 73c6c8cd2f94355ef015e5265d5e65b1 UTF32: 65fe91b81ed1107566f9f9f5ed4ccaf1
All strings in .Net store their chars in 2 bytes by default. When hashing values we have to take this into account and use proper text Encoding. .Net supports 5 encodings: ASCII (7 bits per char), UTF7 (7 bits per char), UTF8 (8 bits = 1 byte per char), Unicode (UTF-16) (16 bits = 2 bytes per char) and UTF32 (32 bits = 4 bytes per char).
For text with only first 127 chars in the ASCII table ASCII, UTF7 and UTF8 encodings all return the same hash, but with UTF16 and UTF32 comesendianness so they don’t return the same hash values. Also note that the ASCII and UTF7 encoding aren’t recommended to be used anymore except in legacy apps. UTF8 should be used instead.
For text with complex chars we have to use Unicode or UTF32 or we loose char information.
Putting .Net and SQL Server together
It turns out that such a simple thing as hashing can become a serious issue if we’re not careful. Trouble always awaits when dealing with text and encodings. :)
When using SQL Server’s varchar data type the .Net encoding to go with is UTF8 since it’s the fastest and most optimized of the three (ASCII, UTF7, UTF8). When using the nvarchar data type to go is Unicode (UTF16) but we also have to know the texts endianness to create correct hashes. UTF32 is practically useless in this case because SQL Server doesn’t have a data type that stores text in 4 bytes/char so we’ll never get the same results if we use it.
This advice only applies when creating hashes both in .Net and SQL server and comparing them. If we’re creating hashes in .Net and only store them in a database then we don’t have to worry about this.
No comments:
Post a Comment