How to hash a large amount of data within Postgres? - Mailing list pgsql-general

From Thorsten Schöning
Subject How to hash a large amount of data within Postgres?
Date
Msg-id 77309168.20210621155309@am-soft.de
Whole thread Raw
Responses Re: How to hash a large amount of data within Postgres?  ("Peter J. Holzer" <hjp-pgsql@hjp.at>)
List pgsql-general
Hi all,

I'm one of those people still storing user uploaded files within
Postgres instead of some file system and over the years this increased
to a point where individual uploads of multiple GiB are reached.

Some years ago I implemented some SQL to read all files, build a table
of SHA256 hashes and tell me how much data is redundant. The goal was
to have a look at which files share the same hash with different LOIDs
and optionally change that, so that all those files are only stored
once on the end.

While the approach was pretty naive, because it simply read all files
into memory to calculate the hashes, I'm somewhat sure it worked in
the past with Postgres 9.6. The executing server had enough free RAM
available as well to process the at most ~4 GiB large files one after
another.

I tried that SQL today with Postgres 11 on UB 18.04 and it failed:

> [Code: 0, SQL State: XX000]  FEHLER: invalid memory alloc request size 1898107949
>   Wobei: PL/pgSQL-Funktion loid_hash_calc(oid,text)[...]

> PostgreSQL
> 11.12 (Ubuntu 11.12-1.pgdg18.04+1)
> PostgreSQL JDBC Driver
> 42.2.9

I searched regaridng that issue and only found two relevant results:
Corrupted rows for some reason and simply size restrictions when
allocating memory. The latter is more likely than the former in my
case, as the restrictions seems to be 1 GiB and I do have larger
files.

I'm doing the following simply currently, because I didn't find any
interfaces allowing to forward blocks of data, LOIDs, file descriptors
or anything like that working with smaller buffers or alike.

> fd      := lo_open( loid,  INV_READ);
> size    := lo_lseek(fd, 0, SEEK_END);
> PERFORM    lo_lseek(fd, 0, SEEK_SET);

> hashBin := digest(loread(fd, size), algorithm);
> hashHex := encode(hashBin,          'hex');

So, is there any way to work around the problem I have currently? Can
I increase the memory restriction somewhere in the config? Are there
any functions available working with blocks of data I'm missing now? I
didn't find any state maintainig HASH-calls.

Thanks!

Mit freundlichen Grüßen

Thorsten Schöning

--
AM-SoFT IT-Service - Bitstore Hameln GmbH
Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister für IT und TK

E-Mail: Thorsten.Schoening@AM-SoFT.de
Web:    http://www.AM-SoFT.de/

Tel:   05151-  9468- 0
Tel:   05151-  9468-55
Fax:   05151-  9468-88
Mobil:  0178-8 9468-04

AM-SoFT IT-Service - Bitstore Hameln GmbH, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 221853 - Geschäftsführer: Janine Galonska


Für Rückfragen stehe ich Ihnen sehr gerne zur Verfügung.

Mit freundlichen Grüßen

Thorsten Schöning


Tel: 05151 9468 0
Fax: 05151 9468 88
Mobil:
Webseite: https://www.am-soft.de

AM-Soft IT-Service - Bitstore Hameln GmbH i.G. ist ein Mitglied der Bitstore Gruppe - Ihr Full-Service-Dienstleister
fürIT und TK 

AM-Soft IT-Service - Bitstore Hameln GmbH i.G.
Brandenburger Str. 7c
31789 Hameln
Tel: 05151 9468 0

Bitstore IT-Consulting GmbH
Zentrale - Berlin Lichtenberg
Frankfurter Allee 285
10317 Berlin
Tel: 030 453 087 80

CBS IT-Service - Bitstore Kaulsdorf UG
Tel: 030 453 087 880 1

Büro Dallgow-Döberitz
Tel: 03322 507 020

Büro Kloster Lehnin
Tel: 033207 566 530

PCE IT-Service - Bitstore Darmstadt UG
Darmstadt
Tel: 06151 392 973 0

Büro Neuruppin
Tel: 033932 606 090

ACI EDV Systemhaus - Bitstore Dresden GmbH
Dresden
Tel: 0351 254 410

Das Systemhaus - Bitstore Magdeburg GmbH
Magdeburg
Tel: 0391 636 651 0

Allerdata.IT - Bitstore Wittenberg GmbH
Wittenberg
Tel: 03491 876 735 7

Büro Liebenwalde
Tel: 033054 810 00

HSA - das Büro - Bitstore Altenburg UG
Altenburg
Tel: 0344 784 390 97

Bitstore IT – Consulting GmbH
NL Piesteritz
Piesteritz
Tel: 03491 644 868 6

Solltec IT-Services - Bitstore Braunschweig UG
Braunschweig
Tel: 0531 206 068 0

MF Computer Service - Bitstore Gütersloh GmbH
Gütersloh
Tel: 05245 920 809 3

Firmensitz: AM-Soft IT-Service - Bitstore Hameln GmbH i.G. , Brandenburger Str. 7c , 31789 Hameln
Geschäftsführer Janine Galonska









pgsql-general by date:

Previous
From: Oliver Kohll
Date:
Subject: replace inside regexp_replace
Next
From: Francisco Olarte
Date:
Subject: Re: replace inside regexp_replace