Thread: How to hash a large amount of data within Postgres?
Hi all, I'm one of those people still storing user uploaded files within Postgres instead of some file system and over the years this increased to a point where individual uploads of multiple GiB are reached. Some years ago I implemented some SQL to read all files, build a table of SHA256 hashes and tell me how much data is redundant. The goal was to have a look at which files share the same hash with different LOIDs and optionally change that, so that all those files are only stored once on the end. While the approach was pretty naive, because it simply read all files into memory to calculate the hashes, I'm somewhat sure it worked in the past with Postgres 9.6. The executing server had enough free RAM available as well to process the at most ~4 GiB large files one after another. I tried that SQL today with Postgres 11 on UB 18.04 and it failed: > [Code: 0, SQL State: XX000] FEHLER: invalid memory alloc request size 1898107949 > Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...] > PostgreSQL > 11.12 (Ubuntu 11.12-1.pgdg18.04+1) > PostgreSQL JDBC Driver > 42.2.9 I searched regaridng that issue and only found two relevant results: Corrupted rows for some reason and simply size restrictions when allocating memory. The latter is more likely than the former in my case, as the restrictions seems to be 1 GiB and I do have larger files. I'm doing the following simply currently, because I didn't find any interfaces allowing to forward blocks of data, LOIDs, file descriptors or anything like that working with smaller buffers or alike. > fd := lo_open( loid, INV_READ); > size := lo_lseek(fd, 0, SEEK_END); > PERFORM lo_lseek(fd, 0, SEEK_SET); > hashBin := digest(loread(fd, size), algorithm); > hashHex := encode(hashBin, 'hex'); So, is there any way to work around the problem I have currently? Can I increase the memory restriction somewhere in the config? Are there any functions available working with blocks of data I'm missing now? I didn't find any state maintainig HASH-calls. Thanks! Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung. Mit freundlichen Grüßen Thorsten Schöning Tel: 05151 9468 0 Fax: 05151 9468 88 Mobil: Webseite: https://www.am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK AM-Soft IT-Service - Bitstore Hameln GmbH i.G. Brandenburger Str. 7c 31789 Hameln Tel: 05151 9468 0 Bitstore IT-Consulting GmbH Zentrale - Berlin Lichtenberg Frankfurter Allee 285 10317 Berlin Tel: 030 453 087 80 CBS IT-Service - Bitstore Kaulsdorf UG Tel: 030 453 087 880 1 Büro Dallgow-Döberitz Tel: 03322 507 020 Büro Kloster Lehnin Tel: 033207 566 530 PCE IT-Service - Bitstore Darmstadt UG Darmstadt Tel: 06151 392 973 0 Büro Neuruppin Tel: 033932 606 090 ACI EDV Systemhaus - Bitstore Dresden GmbH Dresden Tel: 0351 254 410 Das Systemhaus - Bitstore Magdeburg GmbH Magdeburg Tel: 0391 636 651 0 Allerdata.IT - Bitstore Wittenberg GmbH Wittenberg Tel: 03491 876 735 7 Büro Liebenwalde Tel: 033054 810 00 HSA - das Büro - Bitstore Altenburg UG Altenburg Tel: 0344 784 390 97 Bitstore IT – Consulting GmbH NL Piesteritz Piesteritz Tel: 03491 644 868 6 Solltec IT-Services - Bitstore Braunschweig UG Braunschweig Tel: 0531 206 068 0 MF Computer Service - Bitstore Gütersloh GmbH Gütersloh Tel: 05245 920 809 3 Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln Geschäftsführer Janine Galonska
On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote: > Some years ago I implemented some SQL to read all files, build a table > of SHA256 hashes and tell me how much data is redundant. The goal was > to have a look at which files share the same hash with different LOIDs > and optionally change that, so that all those files are only stored > once on the end. > > While the approach was pretty naive, because it simply read all files > into memory to calculate the hashes, I'm somewhat sure it worked in > the past with Postgres 9.6. The executing server had enough free RAM > available as well to process the at most ~4 GiB large files one after > another. > > I tried that SQL today with Postgres 11 on UB 18.04 and it failed: > > > [Code: 0, SQL State: XX000] FEHLER: invalid memory alloc request size 1898107949 > > Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...] [...] > I searched regaridng that issue and only found two relevant results: > Corrupted rows for some reason and simply size restrictions when > allocating memory. The latter is more likely than the former in my > case, as the restrictions seems to be 1 GiB and I do have larger > files. 1 GB is the maximum size of quite a few data structures in PostgreSQL. I don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum size of whatever loread() returns (a bytea?). I would be surprised if this limit was higher in version 9.6 than it is in version 11, however. > I'm doing the following simply currently, because I didn't find any > interfaces allowing to forward blocks of data, LOIDs, file descriptors > or anything like that working with smaller buffers or alike. > > > fd := lo_open( loid, INV_READ); > > size := lo_lseek(fd, 0, SEEK_END); > > PERFORM lo_lseek(fd, 0, SEEK_SET); > > > hashBin := digest(loread(fd, size), algorithm); > > hashHex := encode(hashBin, 'hex'); > > So, is there any way to work around the problem I have currently? Normally, hash libararies have a way to feed chunks of data into a hash computations to avoid having to keep the whole thing in RAM. The pgcrypto extension seems to be lacking such functionality, however. I would build something similar to a Merkle tree: Choose a convenient chunk size (a few MB is probably ok), read the large object in chunks of this size, computing the hash for each. Concatenate all the hashes and compute the hash of that. Add intermediate levels if the the concatenated hashes are still too large to fit in memory. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@hjp.at | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment
On 6/23/21 7:39 PM, Peter J. Holzer wrote: > On 2021-06-21 15:53:09 +0200, Thorsten Schöning wrote: >> Some years ago I implemented some SQL to read all files, build a table >> of SHA256 hashes and tell me how much data is redundant. The goal was >> to have a look at which files share the same hash with different LOIDs >> and optionally change that, so that all those files are only stored >> once on the end. >> >> While the approach was pretty naive, because it simply read all files >> into memory to calculate the hashes, I'm somewhat sure it worked in >> the past with Postgres 9.6. The executing server had enough free RAM >> available as well to process the at most ~4 GiB large files one after >> another. >> >> I tried that SQL today with Postgres 11 on UB 18.04 and it failed: >> >>> [Code: 0, SQL State: XX000] FEHLER: invalid memory alloc request size 1898107949 >>> Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...] > [...] >> I searched regaridng that issue and only found two relevant results: >> Corrupted rows for some reason and simply size restrictions when >> allocating memory. The latter is more likely than the former in my >> case, as the restrictions seems to be 1 GiB and I do have larger >> files. > > 1 GB is the maximum size of quite a few data structures in PostgreSQL. I > don't use PL/pgSQL, but I wouldn't be surprised if it was the maximum > size of whatever loread() returns (a bytea?). I would be surprised if > this limit was higher in version 9.6 than it is in version 11, however. > Well, it's actually a bit worse than that - the maximum allocation size is (1GB - 1B), as it's defined like this: #define MaxAllocSize ((Size) 0x3fffffff) /* 1 gigabyte - 1 */ And this includes both the "user data" and a small "header" used for the bytea value. Depending on what format you use to output the values there may be additional limits (e.g. 'hex' requires 2 characters per byte, so doubling the amount of memory needed). For large objects this is not an issue, because we store them in small chunks, not as one large bytea value. > >> I'm doing the following simply currently, because I didn't find any >> interfaces allowing to forward blocks of data, LOIDs, file descriptors >> or anything like that working with smaller buffers or alike. >> >>> fd := lo_open( loid, INV_READ); >>> size := lo_lseek(fd, 0, SEEK_END); >>> PERFORM lo_lseek(fd, 0, SEEK_SET); >> >>> hashBin := digest(loread(fd, size), algorithm); >>> hashHex := encode(hashBin, 'hex'); >> >> So, is there any way to work around the problem I have currently? > > Normally, hash libararies have a way to feed chunks of data into a hash > computations to avoid having to keep the whole thing in RAM. > The pgcrypto extension seems to be lacking such functionality, however. > > I would build something similar to a Merkle tree: > > Choose a convenient chunk size (a few MB is probably ok), read the large > object in chunks of this size, computing the hash for each. Concatenate > all the hashes and compute the hash of that. Add intermediate levels if > the the concatenated hashes are still too large to fit in memory. > Not sure where you searched, but there definitely are interfaces to read chunks of data from large objects - see this: 1) lo_get (loid, offset, length) https://www.postgresql.org/docs/13/lo-funcs.html 2) lo_seek() + lo_read() https://www.postgresql.org/docs/13/lo-interfaces.html Obviously, you can't do "loread(fd, size)" because that's going to attempt building one large bytea, failing because of the alloc limit. You have to stream the data into the hash. Doing that in plpgsql is possible, although possibly somewhat slow. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Guten Tag Tomas Vondra, am Donnerstag, 24. Juni 2021 um 00:56 schrieben Sie: > Not sure where you searched, but there definitely are interfaces to read > chunks of data from large objects - see this: That wasn't the point, but the lack of HASH-functions working with such read blocks of larger data and maintaining state internally. I only see functions outputting a calculated hash for a given block of data, start/progress/end-interfaces like in many other libs. So, am I missing something or do I need to build something based on multiple individually output hash results instead, like suggested? Mit freundlichen Grüßen Thorsten Schöning -- AM-SoFT IT-Service - Bitstore Hameln GmbH Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK E-Mail: Thorsten.Schoening@AM-SoFT.de Web: http://www.AM-SoFT.de/ Tel: 05151- 9468- 0 Tel: 05151- 9468-55 Fax: 05151- 9468-88 Mobil: 0178-8 9468-04 AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung. Mit freundlichen Grüßen Thorsten Schöning Tel: 05151 9468 0 Fax: 05151 9468 88 Mobil: Webseite: https://www.am-soft.de AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister fürIT und TK AM-Soft IT-Service - Bitstore Hameln GmbH i.G. Brandenburger Str. 7c 31789 Hameln Tel: 05151 9468 0 Bitstore IT-Consulting GmbH Zentrale - Berlin Lichtenberg Frankfurter Allee 285 10317 Berlin Tel: 030 453 087 80 CBS IT-Service - Bitstore Kaulsdorf UG Tel: 030 453 087 880 1 Büro Dallgow-Döberitz Tel: 03322 507 020 Büro Kloster Lehnin Tel: 033207 566 530 PCE IT-Service - Bitstore Darmstadt UG Darmstadt Tel: 06151 392 973 0 Büro Neuruppin Tel: 033932 606 090 ACI EDV Systemhaus - Bitstore Dresden GmbH Dresden Tel: 0351 254 410 Das Systemhaus - Bitstore Magdeburg GmbH Magdeburg Tel: 0391 636 651 0 Allerdata.IT - Bitstore Wittenberg GmbH Wittenberg Tel: 03491 876 735 7 Büro Liebenwalde Tel: 033054 810 00 HSA - das Büro - Bitstore Altenburg UG Altenburg Tel: 0344 784 390 97 Bitstore IT – Consulting GmbH NL Piesteritz Piesteritz Tel: 03491 644 868 6 Solltec IT-Services - Bitstore Braunschweig UG Braunschweig Tel: 0531 206 068 0 MF Computer Service - Bitstore Gütersloh GmbH Gütersloh Tel: 05245 920 809 3 Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln Geschäftsführer Janine Galonska