Thread: md5() sorting
Hi, in GNUmed (wiki.gnumed.de) we use schema hashing to detect whether a database can safely be upgraded or used by a client. The general procedure is this: - generate a line-by-line representation of the database objects in the format "schema.table.column::data type" from the information catalog - sort those lines by their md5() hash (to avoid locale related sorting issues - or so we thought) - generate an md5() hash over the concatenation and compare that to known hashes This has worked nicely so far. However, recently a Norwegian user found that his setup sorts those md5() line hashes differently from other setups. Be his setup broken or not we'd like to avoid this issue in the future. What immediately comes to mind is to convert the md5() hex string to integer and sort by that (numeric sorting should, by all means, be universally acceptable) but, alas, it's to large even for bigint. So, I was thinking to extract parts of the string, convert those into ints and re-concatenate those ints into a string and sort by that - sorting digits-only strings should be pretty safe universally, too. However, I am not entirely sure whether I'd be running a higher risk of collisions that way. (Much simplified) example: md5 = x'fe' (I know an md5 can never really yield that value) part 1 = x'f' = 15 part 2 = x'e' = 14 string to sort by = '1514' (while in reality 'fe' = 254 => '254') I *think* I should be safe but would like to hear another opinion. --- I just realized this is not an SQL or even PostgreSQL related question at all so to justify my post: Should I be going about this sorting or hashing or detection business in another way entirely which can be done at the SQL level ? ;-) Thanks, Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, Nov 07, 2007 at 03:42:11PM +0100, Karsten Hilbert wrote: > Should I be going about this sorting or hashing or detection > business in another way entirely which can be done at the > SQL level ? I'm wondering if you cast the md5sum as a bytea instead of text and then sort, if that would solve it simply. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
On Wed, Nov 07, 2007 at 03:54:02PM +0100, Martijn van Oosterhout wrote: > > Should I be going about this sorting or hashing or detection > > business in another way entirely which can be done at the > > SQL level ? > > I'm wondering if you cast the md5sum as a bytea instead of text and > then sort, if that would solve it simply. Along the lines of ... ORDER BY decode(md5('...'), 'hex'); ? I knew I'd ask here ;-) BTW, my Google Fu was lacking or I'd have found this earlier: http://www.varlena.com/GeneralBits/21.php (see middle of page) Thanks, Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On 11/7/07, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > On Wed, Nov 07, 2007 at 03:54:02PM +0100, Martijn van Oosterhout wrote: > > > Should I be going about this sorting or hashing or detection > > > business in another way entirely which can be done at the > > > SQL level ? > > > > I'm wondering if you cast the md5sum as a bytea instead of text and > > then sort, if that would solve it simply. > > Along the lines of > > ... ORDER BY decode(md5('...'), 'hex'); > > ? I knew I'd ask here ;-) Maybe using digest(.., 'md5') function from pgcrypto would be better? It gives bytea immidiately. -- marko
On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote: > > > I'm wondering if you cast the md5sum as a bytea instead of text and > > > then sort, if that would solve it simply. > > > > Along the lines of > > > > ... ORDER BY decode(md5('...'), 'hex'); > > > Maybe using digest(.., 'md5') function from pgcrypto would be better? > It gives bytea immidiately. Sounds better at first, yes, but requires pgcrypto to be installed on machines onto which GNUmed is to be deployed. Thanks for the suggestion, though. Karsten -- GPG key ID E4071346 @ wwwkeys.pgp.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert wrote: > On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote: > >>>> I'm wondering if you cast the md5sum as a bytea instead of text and >>>> then sort, if that would solve it simply. >>> Along the lines of >>> >>> ... ORDER BY decode(md5('...'), 'hex'); >>> >> Maybe using digest(.., 'md5') function from pgcrypto would be better? >> It gives bytea immidiately. > > Sounds better at first, yes, but requires pgcrypto to be > installed on machines onto which GNUmed is to be deployed. > > Thanks for the suggestion, though. Since an MD5 hash is a byte array, using text (or any locale-encoded representation) of it is a mistake. Hashes are not strings and shouldn't be treated as such. -- Lew
On Nov 12, 2007 1:53 PM, Lew <lew@lwsc.ehost-services.com> wrote: > Karsten Hilbert wrote: > > On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote: > > > >>>> I'm wondering if you cast the md5sum as a bytea instead of text and > >>>> then sort, if that would solve it simply. > >>> Along the lines of > >>> > >>> ... ORDER BY decode(md5('...'), 'hex'); > >>> > >> Maybe using digest(.., 'md5') function from pgcrypto would be better? > >> It gives bytea immidiately. > > > > Sounds better at first, yes, but requires pgcrypto to be > > installed on machines onto which GNUmed is to be deployed. > > > > Thanks for the suggestion, though. > > Since an MD5 hash is a byte array, using text (or any locale-encoded > representation) of it is a mistake. Hashes are not strings and shouldn't be > treated as such. Well, the output of md5() on postgresql is not binary but a hex representation, so it's not that big of a deal really.