Re: How to hash a large amount of data within Postgres? - Mailing list pgsql-general

From Tomas Vondra
Subject Re: How to hash a large amount of data within Postgres?
Date
Msg-id b9f17100-47a0-f675-1531-14803227a11e@enterprisedb.com
Whole thread Raw
In response to Re: How to hash a large amount of data within Postgres?  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses Re: [SPAM] Re: How to hash a large amount of data within Postgres?  (Thorsten Schöning<tschoening@am-soft.de>)
List pgsql-general
On 6/23/21 7:39 PM, Peter J. Holzer wrote:
> On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote:
>> Some years ago I implemented some SQL to read all files, build a table
>> of SHA256 hashes and tell me how much data is redundant. The goal was
>> to have a look at which files share the same hash with different LOIDs
>> and optionally change that, so that all those files are only stored
>> once on the end.
>>
>> While the approach was pretty naive, because it simply read all files
>> into memory to calculate the hashes, I'm somewhat sure it worked in
>> the past with Postgres 9.6. The executing server had enough free RAM
>> available as well to process the at most ~4 GiB large files one after
>> another.
>>
>> I tried that SQL today with Postgres 11 on UB 18.04 and it failed:
>>
>>> [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
>>>   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]
> [...]
>> I searched regaridng that issue and only found two relevant results:
>> Corrupted rows for some reason and simply size restrictions when
>> allocating memory. The latter is more likely than the former in my
>> case, as the restrictions seems to be 1 GiB and I do have larger
>> files.
> 
> 1 GB is the maximum size of quite a few data structures in PostgreSQL. I
> don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum
> size of whatever loread() returns (a bytea?). I would be surprised if
> this limit was higher in version 9.6 than it is in version 11, however.
> 

Well, it's actually a bit worse than that - the maximum allocation size
is (1GB - 1B), as it's defined like this:

#define MaxAllocSize    ((Size) 0x3fffffff) /* 1 gigabyte - 1 */

And this includes both the "user data" and a small "header" used for the
bytea value. Depending on what format you use to output the values there
may be additional limits (e.g. 'hex' requires 2 characters per byte, so
doubling the amount of memory needed).

For large objects this is not an issue, because we store them in small
chunks, not as one large bytea value.

> 
>> I'm doing the following simply currently, because I didn't find any
>> interfaces allowing to forward blocks of data, LOIDs, file descriptors
>> or anything like that working with smaller buffers or alike.
>>
>>> fd      := lo_open( loid,  INV_READ);
>>> size    := lo_lseek(fd, 0, SEEK_END);
>>> PERFORM    lo_lseek(fd, 0, SEEK_SET);
>>
>>> hashBin := digest(loread(fd, size), algorithm);
>>> hashHex := encode(hashBin,          'hex');
>>
>> So, is there any way to work around the problem I have currently?
> 
> Normally, hash libararies have a way to feed chunks of data into a hash
> computations to avoid having to keep the whole thing in RAM.
> The pgcrypto extension seems to be lacking such functionality, however.
> 
> I would build something similar to a Merkle tree:
> 
> Choose a convenient chunk size (a few MB is probably ok), read the large
> object in chunks of this size, computing the hash for each. Concatenate
> all the hashes and compute the hash of that. Add intermediate levels if
> the the concatenated hashes are still too large to fit in memory.
> 

Not sure where you searched, but there definitely are interfaces to read
chunks of data from large objects - see this:

1) lo_get (loid, offset, length)
   https://www.postgresql.org/docs/13/lo-funcs.html

2) lo_seek() + lo_read()
   https://www.postgresql.org/docs/13/lo-interfaces.html

Obviously, you can't do "loread(fd, size)" because that's going to
attempt building one large bytea, failing because of the alloc limit.
You have to stream the data into the hash.

Doing that in plpgsql is possible, although possibly somewhat slow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-general by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Partitioned Table Index Column Order
Next
From: Thomas Munro
Date:
Subject: Re: Is there something similar like flashback query from Oracle planned for PostgreSQL