Thread: md5() sorting

md5() sorting

From
Karsten Hilbert
Date:
Hi,

in GNUmed (wiki.gnumed.de) we use schema hashing to detect
whether a database can safely be upgraded or used by a
client. The general procedure is this:

- generate a line-by-line representation of the database
  objects in the format "schema.table.column::data type"
  from the information catalog

- sort those lines by their md5() hash (to avoid locale
  related sorting issues - or so we thought)

- generate an md5() hash over the concatenation and
  compare that to known hashes

This has worked nicely so far. However, recently a Norwegian
user found that his setup sorts those md5() line hashes
differently from other setups. Be his setup broken or not
we'd like to avoid this issue in the future. What
immediately comes to mind is to convert the md5() hex string
to integer and sort by that (numeric sorting should, by all
means, be universally acceptable) but, alas, it's to large
even for bigint.

So, I was thinking to extract parts of the string, convert
those into ints and re-concatenate those ints into a string
and sort by that - sorting digits-only strings should be
pretty safe universally, too. However, I am not entirely
sure whether I'd be running a higher risk of collisions that
way. (Much simplified) example:

md5 = x'fe'    (I know an md5 can never really yield that value)

part 1 = x'f' = 15
part 2 = x'e' = 14

string to sort by = '1514'
    (while in reality 'fe' = 254 => '254')

I *think* I should be safe but would like to hear another
opinion. --- I just realized this is not an SQL or even
PostgreSQL related question at all so to justify my post:

 Should I be going about this sorting or hashing or detection
 business in another way entirely which can be done at the
 SQL level ?

;-)

Thanks,
Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: md5() sorting

From
Martijn van Oosterhout
Date:
On Wed, Nov 07, 2007 at 03:42:11PM +0100, Karsten Hilbert wrote:
>  Should I be going about this sorting or hashing or detection
>  business in another way entirely which can be done at the
>  SQL level ?

I'm wondering if you cast the md5sum as a bytea instead of text and
then sort, if that would solve it simply.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

Re: md5() sorting

From
Karsten Hilbert
Date:
On Wed, Nov 07, 2007 at 03:54:02PM +0100, Martijn van Oosterhout wrote:

> >  Should I be going about this sorting or hashing or detection
> >  business in another way entirely which can be done at the
> >  SQL level ?
>
> I'm wondering if you cast the md5sum as a bytea instead of text and
> then sort, if that would solve it simply.

Along the lines of

    ... ORDER BY decode(md5('...'), 'hex');

? I knew I'd ask here ;-)

BTW, my Google Fu was lacking or I'd have found this earlier:

    http://www.varlena.com/GeneralBits/21.php
    (see middle of page)

Thanks,
Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: md5() sorting

From
"Marko Kreen"
Date:
On 11/7/07, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> On Wed, Nov 07, 2007 at 03:54:02PM +0100, Martijn van Oosterhout wrote:
> > >  Should I be going about this sorting or hashing or detection
> > >  business in another way entirely which can be done at the
> > >  SQL level ?
> >
> > I'm wondering if you cast the md5sum as a bytea instead of text and
> > then sort, if that would solve it simply.
>
> Along the lines of
>
>         ... ORDER BY decode(md5('...'), 'hex');
>
> ? I knew I'd ask here ;-)

Maybe using digest(.., 'md5') function from pgcrypto would be better?
It gives bytea immidiately.

--
marko

Re: md5() sorting

From
Karsten Hilbert
Date:
On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote:

> > > I'm wondering if you cast the md5sum as a bytea instead of text and
> > > then sort, if that would solve it simply.
> >
> > Along the lines of
> >
> >         ... ORDER BY decode(md5('...'), 'hex');
> >
> Maybe using digest(.., 'md5') function from pgcrypto would be better?
> It gives bytea immidiately.

Sounds better at first, yes, but requires pgcrypto to be
installed on machines onto which GNUmed is to be deployed.

Thanks for the suggestion, though.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: md5() sorting

From
Lew
Date:
Karsten Hilbert wrote:
> On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote:
>
>>>> I'm wondering if you cast the md5sum as a bytea instead of text and
>>>> then sort, if that would solve it simply.
>>> Along the lines of
>>>
>>>         ... ORDER BY decode(md5('...'), 'hex');
>>>
>> Maybe using digest(.., 'md5') function from pgcrypto would be better?
>> It gives bytea immidiately.
>
> Sounds better at first, yes, but requires pgcrypto to be
> installed on machines onto which GNUmed is to be deployed.
>
> Thanks for the suggestion, though.

Since an MD5 hash is a byte array, using text (or any locale-encoded
representation) of it is a mistake.  Hashes are not strings and shouldn't be
treated as such.

--
Lew

Re: md5() sorting

From
"Scott Marlowe"
Date:
On Nov 12, 2007 1:53 PM, Lew <lew@lwsc.ehost-services.com> wrote:
> Karsten Hilbert wrote:
> > On Wed, Nov 07, 2007 at 05:36:47PM +0200, Marko Kreen wrote:
> >
> >>>> I'm wondering if you cast the md5sum as a bytea instead of text and
> >>>> then sort, if that would solve it simply.
> >>> Along the lines of
> >>>
> >>>         ... ORDER BY decode(md5('...'), 'hex');
> >>>
> >> Maybe using digest(.., 'md5') function from pgcrypto would be better?
> >> It gives bytea immidiately.
> >
> > Sounds better at first, yes, but requires pgcrypto to be
> > installed on machines onto which GNUmed is to be deployed.
> >
> > Thanks for the suggestion, though.
>
> Since an MD5 hash is a byte array, using text (or any locale-encoded
> representation) of it is a mistake.  Hashes are not strings and shouldn't be
> treated as such.

Well, the output of md5() on postgresql is not binary but a hex
representation, so it's not that big of a deal really.