Re: MD5 aggregate - Mailing list pgsql-hackers
From | Dean Rasheed |
---|---|
Subject | Re: MD5 aggregate |
Date | |
Msg-id | CAEZATCXUUrx0QnLtbW5=ZmsAMQkCopWNwGqeR45qoOSOaLV9dw@mail.gmail.com Whole thread Raw |
In response to | Re: MD5 aggregate (Noah Misch <noah@leadboat.com>) |
Responses |
Re: MD5 aggregate
Re: MD5 aggregate |
List | pgsql-hackers |
On 26 June 2013 19:32, Noah Misch <noah@leadboat.com> wrote: > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: >> I've been playing around with the idea of an aggregate that computes >> the sum of the md5 hashes of each of its inputs, which I've called >> md5_total() for now, although I'm not particularly wedded to that >> name. Comparing it with md5_agg() on a 100M row table (see attached >> test script) produces interesting results: >> >> SELECT md5_agg(foo.*::text) >> FROM (SELECT * FROM foo ORDER BY id) foo; >> >> 50bc42127fb9b028c9708248f835ed8f >> >> Time: 92960.021 ms >> >> SELECT md5_total(foo.*::text) FROM foo; >> >> 02faea7fafee4d253fc94cfae031afc43c03479c >> >> Time: 96190.343 ms >> >> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its >> result is longer) but it seems like it would be very useful for >> quickly comparing data in SQL, since its value is not dependent on the >> row-order making it easier to use and better performing if there is no >> usable index for ordering. > > I took a look at this patch. First, as Peter suggested upthread, it should > have been two patches: one to optimize calculateDigestFromBuffer() and > callees, another atop the first adding new SQL functions and adjusting the > infrastructure to meet their needs. > I didn't really set out with the aim of optimising the existing md5() functions at all, but merely to restructure the code in a way that would expose the necessary API for md5_agg(). As the timings up-thread show, the performance gain is really very modest, although the memory saving might be more of a factor in some setups. In fact, for the sorts of queries shown, the vast majority of the time is spent elsewhere, as can be seen by replacing md5() with a more trivial function such as length(). >> + <primary>md5_total</primary> >> + </indexterm> >> + <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function> >> + </entry> >> + <entry><type>text</type> or <type>bytea</type></entry> >> + <entry><type>text</type></entry> >> + <entry> >> + sum of the MD5 hash values of the input values, independent of their >> + order > > This is not specific enough. We would need to provide either an algorithm > specification or a literature reference. However, that just leads to the > problem that we should not put a literature-unrecognized cryptographic > algorithm into core PostgreSQL. I realize that the use case inspiring this > patch wasn't concerned with cryptographic properties, but the association with > md5 immediately makes it a cryptography offering. > Yes, I can totally understand that point-of-view. I wasn't really intending to write md5_total() at the outset - it was more of an intellectual exercise following some of the previous points raised. So on reflection, I think I can also agree that that probably doesn't belong in core. > md5_agg() is well-defined and not cryptographically novel, and your use case > is credible. However, not every useful-sometimes function belongs in core; we > mostly stick to functions with ubiquitous value and functions that would be > onerous to implement externally. (We do carry legacy stuff that wouldn't make > the cut today.) In my estimation, md5_agg() does not make that cut. The > variety of intuitive md5_agg() definitions attested upthread doesn't bode well > for its broad applicability. It could just as well live in an extension > published on PGXN. Mine is just one opinion, though; I won't be horrified if > the community wants an md5_agg() in core. > I disagree with this though. I see md5_agg() as a natural extension of the already-in-core md5() functions and underlying code, satisfying a well-defined use-case, and providing a checksum comparable with externally generated md5 sums. A quick google search reveals several people asking for something like this, and people recommending md5(string_agg(...)) or md5(string_agg(md5(...))) based solutions, which are doomed to failure on larger tables. So I think that there is a case for having md5_agg() in core as an alternative to such hacks, while having more sophisticated crypto functions available as extensions. >> *** a/src/backend/libpq/md5.c >> --- b/src/backend/libpq/md5.c >> *************** >> *** 1,14 **** >> /* >> * md5.c >> * >> ! * Implements the MD5 Message-Digest Algorithm as specified in >> ! * RFC 1321. This implementation is a simple one, in that it >> ! * needs every input byte to be buffered before doing any >> ! * calculations. I do not expect this file to be used for >> ! * general purpose MD5'ing of large amounts of data, only for >> ! * generating hashed passwords from limited input. > > In other words, performance wasn't a design criterion. I wonder if we would > be wasting our time with a narrow performance change like removing the data > copy. What's your opinion of copying pgcrypto's md5 implementation, which > already avoids the data copy? > I haven't looked at pgcrypto because, as I said, performance wasn't my primary criterion either, but removing the unnessary data copy just seemed like good sense. I'll take a look at it, and then as you and Peter suggest, look to split the patch into two. I think I will withdraw md5_total() because I was never entirely happy with that anyway. Thanks for the review. Regards, Dean
pgsql-hackers by date: