Home > mailing lists

Re: Aggregate versions of hashing functions (md5, sha1, etc...) - Mailing list pgsql-general

From	Merlin Moncure
Subject	Re: Aggregate versions of hashing functions (md5, sha1, etc...)
Date	July 10, 2025 22:02:59
Msg-id	CAHyXU0ztigAau2x5zut6tdjNVXENEGR00F-Mvx4X-20zFz-NnQ@mail.gmail.com Whole thread
In response to	Aggregate versions of hashing functions (md5, sha1, etc...) (Dominique Devienne <ddevienne@gmail.com>)
Responses	Re: Aggregate versions of hashing functions (md5, sha1, etc...)
List	pgsql-general

Tree view

On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <ddevienne@gmail.com> wrote:

We store scientific information in PostgreSQL, and some of that is
bytea and large, thus we must "chunk it" both for performance, and not
be limited to 1GB (we do exceed that, in rare occasions).

Recently I added md5/sha1 hashing support for such values (for various
reasons, to track corruptions in our ETLs, now fixed, but also in the
future for custom smart sync and such), and was surprised to discover
there are no aggregate versions of those functions, neither the
built-in md5 one (now with bytea overload), nor for pgcrypto. Did I
miss something?

Any chance this might be added in the future?

Seems so logical to me, that these hashing functions were available
are aggregates, I can't be the first one to think of that, can it?

Thanks, --DD

PS: The holly-grail IMHO, would be built-in support for hashing, with
intelligent lazy-compute and persistence correctly reset on changes.
Probably opt-in.

PPS: Built-in Support for xxh64, or at least in official pgcrypto,
would also be nice. With aggregate!

PPPS: I remember Oracle LOBs supporting a built-in implicit mime-type
attribute on them, which was used as the content-type with served over
the built-in WebDAV extension for the DB. I wish PostgreSQL had
something like that too.

PPPPS: the lo extension is not viable for us, FWIW. All the above
should be opt-in on bytea columns IMHO.

Does it have to match the md5 of the 'unchunked' variant exactly? If not, maybe you can rig a custom aggregate that would just "hash amend" the chunks using the n-1 chunk has as salt, this would be fast and easy, at the cost of not matching the unchunked variant.

I had to do something very similar with compression, I had a need to be able to compress bytea values with lz4 at SQL level due to limited support for extensions in the cloud. it works great...if a bit slow, and requires the database to handle the decompression.

merlin

pgsql-general by date:

From: "David G. Johnston"
Date: 10 July 2025, 21:04:35
Subject: Re: having temp_tablespaces on less reliable storage

From: Greg Hennessy
Date: 10 July 2025, 22:39:12
Subject: optimizing number of workers

Re: Aggregate versions of hashing functions (md5, sha1, etc...) - Mailing list pgsql-general

Previous

Next