On Thu, Jul 10, 2025 at 9:03 PM Merlin Moncure <mmoncure@gmail.com> wrote:
> On Thu, Jul 10, 2025 at 5:48 AM Dominique Devienne <ddevienne@gmail.com> wrote:
>> We store scientific information in PostgreSQL, and some of that is
>> bytea and large, thus we must "chunk it" both for performance, and not
>> be limited to 1GB (we do exceed that, in rare occasions).
>
> Does it have to match the md5 of the 'unchunked' variant exactly?
Well, yeah, ideally.
> If not, maybe you can rig a custom aggregate that would just "hash amend"
Sure, there are various ways to get a deterministic hash of such a
chunked-across-rows (logical) value. But it always involves hashing
the hash, which statistically might not be ideal. But more
importantly, it departs from the normal "serial" hashing of the full
logical hash. With the full hash, I can use various FS OS tools and
SQLite aggregate-UDFs and TBD PostgreSQL aggregate-Digests to compare
those hashes in natural and consistent ways. Working around lack of
aggregate digests in PostgreSQL forces to replicate those work-arounds
at the FS and SQLite levels, for comparisons. I.e. Not good.
> I had to do something very similar with compression, I had a need to be able to compress bytea values with lz4 at SQL
leveldue to limited support for extensions in the cloud. it works great...if a bit slow, and requires the database to
handlethe decompression.
The slow part is what I want to avoid. I explored various ways to
work-around true server-side aggregate hashing, and none are
satisfactory nor performant enough. This is really the kind of
primitive that must be built-in to be efficient enough to be "useful".
(and I'd argue compression, deduplication, etc... are similar issues.
BLOBs / BYTEAs are too often "limited" in RDBMSs, with people saying
store them outside the DB, and I always find so strange, and a bit of
a cope out to making it performant enough and/or full features enough.
A bit of a rant here, sorry ;)).