Re: MD5 aggregate - Mailing list pgsql-hackers

From Dean Rasheed
Subject Re: MD5 aggregate
Date
Msg-id CAEZATCXUUrx0QnLtbW5=ZmsAMQkCopWNwGqeR45qoOSOaLV9dw@mail.gmail.com
Whole thread Raw
In response to Re: MD5 aggregate  (Noah Misch <noah@leadboat.com>)
Responses Re: MD5 aggregate
Re: MD5 aggregate
List pgsql-hackers
On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> I've been playing around with the idea of an aggregate that computes
>> the sum of the md5 hashes of each of its inputs, which I've called
>> md5_total() for now, although I'm not particularly wedded to that
>> name. Comparing it with md5_agg() on a 100M row table (see attached
>> test script) produces interesting results:
>>
>> SELECT md5_agg(foo.*::text)
>>   FROM (SELECT * FROM foo ORDER BY id) foo;
>>
>>  50bc42127fb9b028c9708248f835ed8f
>>
>> Time: 92960.021 ms
>>
>> SELECT md5_total(foo.*::text) FROM foo;
>>
>>  02faea7fafee4d253fc94cfae031afc43c03479c
>>
>> Time: 96190.343 ms
>>
>> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> result is longer) but it seems like it would be very useful for
>> quickly comparing data in SQL, since its value is not dependent on the
>> row-order making it easier to use and better performing if there is no
>> usable index for ordering.
>
> I took a look at this patch.  First, as Peter suggested upthread, it should
> have been two patches: one to optimize calculateDigestFromBuffer() and
> callees, another atop the first adding new SQL functions and adjusting the
> infrastructure to meet their needs.
>

I didn't really set out with the aim of optimising the existing md5()
functions at all, but merely to restructure the code in a way that
would expose the necessary API for md5_agg(). As the timings up-thread
show, the performance gain is really very modest, although the memory
saving might be more of a factor in some setups. In fact, for the
sorts of queries shown, the vast majority of the time is spent
elsewhere, as can be seen by replacing md5() with a more trivial
function such as length().


>> +         <primary>md5_total</primary>
>> +        </indexterm>
>> +        <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
>> +       </entry>
>> +       <entry><type>text</type> or <type>bytea</type></entry>
>> +       <entry><type>text</type></entry>
>> +       <entry>
>> +        sum of the MD5 hash values of the input values, independent of their
>> +        order
>
> This is not specific enough.  We would need to provide either an algorithm
> specification or a literature reference.  However, that just leads to the
> problem that we should not put a literature-unrecognized cryptographic
> algorithm into core PostgreSQL.  I realize that the use case inspiring this
> patch wasn't concerned with cryptographic properties, but the association with
> md5 immediately makes it a cryptography offering.
>

Yes, I can totally understand that point-of-view. I wasn't really
intending to write md5_total() at the outset - it was more of an
intellectual exercise following some of the previous points raised. So
on reflection, I think I can also agree that that probably doesn't
belong in core.


> md5_agg() is well-defined and not cryptographically novel, and your use case
> is credible.  However, not every useful-sometimes function belongs in core; we
> mostly stick to functions with ubiquitous value and functions that would be
> onerous to implement externally.  (We do carry legacy stuff that wouldn't make
> the cut today.)  In my estimation, md5_agg() does not make that cut.  The
> variety of intuitive md5_agg() definitions attested upthread doesn't bode well
> for its broad applicability.  It could just as well live in an extension
> published on PGXN.  Mine is just one opinion, though; I won't be horrified if
> the community wants an md5_agg() in core.
>

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.


>> *** a/src/backend/libpq/md5.c
>> --- b/src/backend/libpq/md5.c
>> ***************
>> *** 1,14 ****
>>   /*
>>    *  md5.c
>>    *
>> !  *  Implements      the  MD5 Message-Digest Algorithm as specified in
>> !  *  RFC  1321.      This  implementation  is a simple one, in that it
>> !  *  needs  every  input  byte  to  be  buffered  before doing any
>> !  *  calculations.  I  do  not  expect  this  file  to be used for
>> !  *  general  purpose  MD5'ing  of large amounts of data, only for
>> !  *  generating hashed passwords from limited input.
>
> In other words, performance wasn't a design criterion.  I wonder if we would
> be wasting our time with a narrow performance change like removing the data
> copy.  What's your opinion of copying pgcrypto's md5 implementation, which
> already avoids the data copy?
>

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Thanks for the review.

Regards,
Dean



pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [PATCH] Fix conversion for Decimal arguments in plpython functions
Next
From: Szymon Guz
Date:
Subject: Re: [PATCH] Fix conversion for Decimal arguments in plpython functions