Home > mailing lists

Re: How to hash a large amount of data within Postgres? - Mailing list pgsql-general

From	Tomas Vondra
Subject	Re: How to hash a large amount of data within Postgres?
Date	June 23, 2021 22:56:04
Msg-id	b9f17100-47a0-f675-1531-14803227a11e@enterprisedb.com Whole thread Raw
In response to	Re: How to hash a large amount of data within Postgres? ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
Responses	Re: [SPAM] Re: How to hash a large amount of data within Postgres?
List	pgsql-general

Tree view

On 6/23/21 7:39 PM, Peter J. Holzer wrote:
> On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote:
>> Some years ago I implemented some SQL to read all files, build a table
>> of SHA256 hashes and tell me how much data is redundant. The goal was
>> to have a look at which files share the same hash with different LOIDs
>> and optionally change that, so that all those files are only stored
>> once on the end.
>>
>> While the approach was pretty naive, because it simply read all files
>> into memory to calculate the hashes, I'm somewhat sure it worked in
>> the past with Postgres 9.6. The executing server had enough free RAM
>> available as well to process the at most ~4 GiB large files one after
>> another.
>>
>> I tried that SQL today with Postgres 11 on UB 18.04 and it failed:
>>
>>> [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
>>>   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]
> [...]
>> I searched regaridng that issue and only found two relevant results:
>> Corrupted rows for some reason and simply size restrictions when
>> allocating memory. The latter is more likely than the former in my
>> case, as the restrictions seems to be 1 GiB and I do have larger
>> files.
> 
> 1 GB is the maximum size of quite a few data structures in PostgreSQL. I
> don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum
> size of whatever loread() returns (a bytea?). I would be surprised if
> this limit was higher in version 9.6 than it is in version 11, however.
> 

Well, it's actually a bit worse than that - the maximum allocation size
is (1GB - 1B), as it's defined like this:

#define MaxAllocSize    ((Size) 0x3fffffff) /* 1 gigabyte - 1 */

And this includes both the "user data" and a small "header" used for the
bytea value. Depending on what format you use to output the values there
may be additional limits (e.g. 'hex' requires 2 characters per byte, so
doubling the amount of memory needed).

For large objects this is not an issue, because we store them in small
chunks, not as one large bytea value.

> 
>> I'm doing the following simply currently, because I didn't find any
>> interfaces allowing to forward blocks of data, LOIDs, file descriptors
>> or anything like that working with smaller buffers or alike.
>>
>>> fd      := lo_open( loid,  INV_READ);
>>> size    := lo_lseek(fd, 0, SEEK_END);
>>> PERFORM    lo_lseek(fd, 0, SEEK_SET);
>>
>>> hashBin := digest(loread(fd, size), algorithm);
>>> hashHex := encode(hashBin,          'hex');
>>
>> So, is there any way to work around the problem I have currently?
> 
> Normally, hash libararies have a way to feed chunks of data into a hash
> computations to avoid having to keep the whole thing in RAM.
> The pgcrypto extension seems to be lacking such functionality, however.
> 
> I would build something similar to a Merkle tree:
> 
> Choose a convenient chunk size (a few MB is probably ok), read the large
> object in chunks of this size, computing the hash for each. Concatenate
> all the hashes and compute the hash of that. Add intermediate levels if
> the the concatenated hashes are still too large to fit in memory.
> 

Not sure where you searched, but there definitely are interfaces to read
chunks of data from large objects - see this:

1) lo_get (loid, offset, length)
   https://www.postgresql.org/docs/13/lo-funcs.html

2) lo_seek() + lo_read()
   https://www.postgresql.org/docs/13/lo-interfaces.html

Obviously, you can't do "loread(fd, size)" because that's going to
attempt building one large bytea, failing because of the alloc limit.
You have to stream the data into the hash.

Doing that in plpgsql is possible, although possibly somewhat slow.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-general by date:

From: Alvaro Herrera
Date: 23 June 2021, 22:55:40
Subject: Re: Partitioned Table Index Column Order

From: Thomas Munro
Date: 23 June 2021, 23:37:54
Subject: Re: Is there something similar like flashback query from Oracle planned for PostgreSQL

Re: How to hash a large amount of data within Postgres? - Mailing list pgsql-general

Previous

Next