Thread: aggregate hash function

aggregate hash function

From
"Matthew Dennis"
Date:
I'm in need of an aggregate hash function.  Something like "select md5_agg(someTextColumn) from (select someTextColumn from someTable order by someOrderingColumn)".  I know that there is an existing MD5 function, but it is not an aggregate.  I have thought about writing a "concat" aggregate function that would concatenate the input into a long string and then using MD5() on that, but that seems like it would have some bad performance implications (memory consumption, possibly spilling to disk, many large memory copies, etc) as it would buildup the entire concatenated string first before hashing it.

I also thought about making a aggregate function that works by keeping the MD5 result as a string in the state, then concatenating the new input with the current state, hashing that and using it as the new state.  This solves the problem of building up a giant string to just traverse over at the end to get the MD5 sum.  This approach would actually work for me, but it doesn't give me the actual MD5 sum of the data which is what I really want.

comments/ideas/suggestions?

Re: aggregate hash function

From
"Vyacheslav Kalinin"
Date:
Most implementations of md5 internally consist of 3 functions: md5_init - which initializes internal context, md5_update - which accepts portions of data and processes them and md5_final - which finalizes the hash and releases the context. These roughly suit  aggregate's  internal functions (SFUNC and FINALFUNC,  md5_init is probably to be called on first actual input). Since performance  is important for you the functions should be written in low-level language as C, to me it doesn't look difficult to take some C md5 module and adapt it to be an aggregate... though it's not like I would do this easily myself :)



Re: aggregate hash function

From
"Matthew Dennis"
Date:
On Jan 30, 2008 4:40 PM, Vyacheslav Kalinin <vka@mgcp.com> wrote:
Most implementations of md5 internally consist of 3 functions: md5_init - which initializes internal context, md5_update - which accepts portions of data and processes them and md5_final - which finalizes the hash and releases the context. These roughly suit  aggregate's  internal functions (SFUNC and FINALFUNC,  md5_init is probably to be called on first actual input). Since performance  is important for you the functions should be written in low-level language as C, to me it doesn't look difficult to take some C md5 module and adapt it to be an aggregate... though it's not like I would do this easily myself :)

Yes, thank you, I'm aware of how MD5 works - that's precisely why I don't like the idea of concatenating everything together first.  I was hoping that because PG already exposed an MD5 function that it used a stdlib and also exposed the constituent functions and I just wasn't looking in the right place for them.  Assuming it did, it would be pretty trivial to use them for SFUNC and FFUNC in creating an aggregate.

Thanks for the help.