Thread: md5(large_object_id)
I have searched the mailing list archives as well as the internet at large but haven't been able to find a _streaming_ version of md5 hashing for large objects. The closest thing I found is this thread http://www.postgresql.org/message-id/5d0f60990704081003p615530acx86257a70bff33fa4@mail.gmail.com which eventually lead to this code http://llg.cubic.org/pg-mdagg/ which, however, doesn't stream in the large object data either. Essentially, for hashing _very_ large objects (to the order of several GB per object, I am dealing with radiology studies aka DICOM data) one would want an md5 function which streams in parts of a large object piece by piece using md5_update and m5_finalize or some such. It didn't look like pgcrypto offers a streaming version either. Anything I am overlooking ? Thanks, Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Mon, Sep 21, 2015 at 10:17 PM, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
I'm not sure if there is much to be gained from storing such large objects in a database. Also, the max size of a value is 1GB.
Maybe it would be better to do some file administration in the database, but not the actual storage. Then you could use a tool that does what you want on the bare file and maybe store the results in the database.
HTH,
--
I am dealing with radiology
studies aka DICOM data) one would want an md5 function which
streams in parts of a large object piece by piece using
md5_update and m5_finalize or some such.
It didn't look like pgcrypto offers a streaming version either.
Anything I am overlooking ?
Hi,
I don't think that it is possible to stream the result of a query anyway, or is this some neat new feature?--
Willy-Bas Loos
> I don't think that it is possible to stream the result of a query anyway, I was unclear. I don't expect query results to stream out to the client. I want the "SELECT md5(OID);" to return a single md5 hash value. It is already possible to "SELECT md5(lo_read(OID));" but that will read the entire large object into memory. I would like the function md5(oid) to _internally_ feed consecutive calls to lo_read() into an updating md5 function and eventually return the final hash. > I'm not sure if there is much to be gained from storing such large > objects in a database. ACID. Permissions. Single access method. > Also, the max size of a value is 1GB. The maximum size is 4 Terabyte ATM for large objects. Karsten Hilbert
Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > I am dealing with radiology studies aka DICOM data) one would > want an md5 function which streams in parts of a large object > piece by piece using md5_update and m5_finalize or some such. It would certainly be possible to write a lo_md5(oid) function to do this, but as far as I'm aware nobody has yet done so. How are your C skills? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote: > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > > > I am dealing with radiology studies aka DICOM data) one would > > want an md5 function which streams in parts of a large object > > piece by piece using md5_update and m5_finalize or some such. > It would certainly be possible to write a lo_md5(oid) function to do > this, but as far as I'm aware nobody has yet done so. How are your > C skills? Next to none, unfortunately :-/ Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
Karsten Hilbert wrote: > On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote: > > > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > > > > > I am dealing with radiology studies aka DICOM data) one would > > > want an md5 function which streams in parts of a large object > > > piece by piece using md5_update and m5_finalize or some such. > > It would certainly be possible to write a lo_md5(oid) function to do > > this, but as far as I'm aware nobody has yet done so. How are your > > C skills? > > Next to none, unfortunately :-/ It would be nice to have this API exposed in pgcrypto, but as a workaround, you could use plperl instead, or a similarly "easy" PL that provides it. Here's an example with plperlu, using the core Digest module. It works with other hash types, not just MD5. See http://perldoc.perl.org/Digest.html for the exact names (warning: 'SHA-1' instead of pgcrypto's 'SHA1'). CREATE OR REPLACE FUNCTION lo_digest(text, oid, int) RETURNS bytea AS $$ use Digest; use strict; use Data::Dumper; my $ctxt = Digest->new($_[0]); my $sz=$_[2]; elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0); my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd"); my $row = spi_fetchrow($sth); spi_cursor_close($sth); if ($row) { my $fd = $row->{fd}; my $bytes; my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk"); do { $sth = spi_query_prepared($plan); $row = spi_fetchrow($sth); $bytes = decode_bytea($row->{chunk}); $ctxt->add($bytes); spi_cursor_close($sth); } while (length($bytes)>0); spi_exec_query("select lo_close($fd)"); spi_freeplan($plan); } return encode_bytea($ctxt->digest); $$ LANGUAGE plperlu; Example of use: # select lo_digest('MD5', 2557608, 2048*256); lo_digest ------------------------------------ \xa8447e145d0f8d9ca7fe7df1bbf06d75 2557608 is the oid and 2048*256 represents 512KB. The lo blocksize is typically 2048, so a multiple of 2048 is ideal to optimize reads. I expect it to work on very large contents while requesting only modest amounts of memory. But unfortunately is seems pretty slow. On my desktop core i5-3470, it takes 2.5 seconds to digest a 100 MB audio file already in cache. An independant Perl script doing the equiavlent processing on the same data takes 0.4s . It's not because of Digest::MD5 or loread(), it seems that it's the pipe in-between with the text->bytea decoding that eats most of the CPU cycles, that step being necessary because plperl lacks the ability to consume bytea directly. Sigh... Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote: > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote: > > > I am dealing with radiology studies aka DICOM data) one would > > want an md5 function which streams in parts of a large object > > piece by piece using md5_update and m5_finalize or some such. > It would certainly be possible to write a lo_md5(oid) function to do > this, but as far as I'm aware nobody has yet done so. How are your > C skills? I had hoped someone was going to say: "Yeah, right, low hanging fruit, let's just do it for 9.next" :-) Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On Wed, Oct 07, 2015 at 12:55:38PM +0200, Karsten Hilbert wrote: > > > I am dealing with radiology studies aka DICOM data) one would > > > want an md5 function which streams in parts of a large object > > > piece by piece using md5_update and m5_finalize or some such. > > It would certainly be possible to write a lo_md5(oid) function to do > > this, but as far as I'm aware nobody has yet done so. How are your > > C skills? > > I had hoped someone was going to say: "Yeah, right, low > hanging fruit, let's just do it for 9.next" :-) Someone _with_ C skills, that is. Thanks for the plperlu suggestion, btw. Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
2015-10-07 13:18 GMT+02:00 Karsten Hilbert <Karsten.Hilbert@gmx.net>:
On Wed, Oct 07, 2015 at 12:55:38PM +0200, Karsten Hilbert wrote:
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so. How are your
> > C skills?
>
> I had hoped someone was going to say: "Yeah, right, low
> hanging fruit, let's just do it for 9.next" :-)
Someone _with_ C skills, that is.
if the size of blobs is less than 1GB, then it should be possible in plpgsql too.
postgres=# \lo_import ~/Desktop/001.jpg
lo_import 24577
postgres=# select md5(lo_get(24577));
┌──────────────────────────────────┐
│ md5 │
╞══════════════════════════════════╡
│ 610ccaab8c7c60e1168abfa799d1305d │
└──────────────────────────────────┘
(1 row)
postgres=# \lo_import ~/Desktop/001.jpg
lo_import 24577
postgres=# select md5(lo_get(24577));
┌──────────────────────────────────┐
│ md5 │
╞══════════════════════════════════╡
│ 610ccaab8c7c60e1168abfa799d1305d │
└──────────────────────────────────┘
(1 row)
Regards
Pavel
Thanks for the plperlu suggestion, btw.
Karsten
--
GPG key ID E4071346 @ eu.pool.sks-keyservers.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Wed, Oct 07, 2015 at 01:30:24PM +0200, Pavel Stehule wrote: > > > > > I am dealing with radiology studies aka DICOM data) one would > > > > > want an md5 function which streams in parts of a large object > > > > > piece by piece using md5_update and m5_finalize or some such. > > > > It would certainly be possible to write a lo_md5(oid) function to do > > > > this, but as far as I'm aware nobody has yet done so. How are your > > > > C skills? > > > > > > I had hoped someone was going to say: "Yeah, right, low > > > hanging fruit, let's just do it for 9.next" :-) > > > > Someone _with_ C skills, that is. > > > > if the size of blobs is less than 1GB, then it should be possible in > plpgsql too. > > postgres=# \lo_import ~/Desktop/001.jpg > lo_import 24577 > postgres=# select md5(lo_get(24577)); > ┌──────────────────────────────────┐ > │ md5 │ > ╞══════════════════════════════════╡ > │ 610ccaab8c7c60e1168abfa799d1305d │ > └──────────────────────────────────┘ > (1 row) It is, I know. but I am exactly after the use case > 1 GB Thanks, Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
For the record - I have also devised another solution to the underlying problem (md5(bytea) cannot process large amounts of input), chunked md5(): create or replace function lo_chunked_md5(oid, int) returns text language 'plpgsql' stable strict as ' DECLARE lo_id alias for $1; chunk_size alias for $2; _lo_fd integer; _lo_size integer; _chunk_count integer; _remainder integer; _md5_concat text; INV_READ constant integer := x''40000''::integer; SEEK_SET constant integer := 0; SEEK_END constant integer := 2; BEGIN -- check for existence of lo_id _lo_fd := lo_open(lo_id, INV_READ); -- get size _lo_size := lo_lseek(_lo_fd, 0, SEEK_END); PERFORM lo_close(_lo_fd); -- calculate chunks and remainder _chunk_count := _lo_size / chunk_size; _remainder := _lo_size % chunk_size; -- loop over chunks _md5_concat := ''''; FOR _chunk_id in 1.._chunk_count LOOP _md5_concat := _md5_concat || md5(lo_get(lo_id, (_chunk_id - 1) * chunk_size, chunk_size)); END LOOP; -- add remainder _md5_concat := _md5_concat || md5(lo_get(lo_id, _chunk_count * chunk_size, _remainder)); return md5(_md5_concat); END;'; This can easily be mirrored by a client-side function, say, def file2chunked_md5(filename=None, chunk_size=500*_MB): _log.debug('chunked_md5(%s, %s)', filename, chunk_size) md5_concat = u'' f = open(filename, 'rb') while True: md5 = hashlib.md5() data = f.read(chunk_size) if not data: break md5.update(data) md5_concat += md5.hexdigest() f.close() md5 = hashlib.md5() md5.update(md5_concat) hex_digest = md5.hexdigest() _log.debug('md5(%s): %s', md5_concat, hex_digest) return hexdigest in Python. Still, I'd welcome a native, streaming md5(loid) which is bound to be more optimized by design. Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346
On 10/7/15 10:34 AM, Karsten Hilbert wrote: > Still, I'd welcome a native, streaming md5(loid) which is > bound to be more optimized by design. It would be nice if we had an interface to TOAST that allowed for streaming (well, really chunking) data to a function. That wouldn't help in this particular case, but it would significantly expand the usefulness of a streaming version of md5 and all the other hash operators. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Wed, Oct 07, 2015 at 01:01:41PM -0500, Jim Nasby wrote: > >Still, I'd welcome a native, streaming md5(loid) which is > >bound to be more optimized by design. > > It would be nice if we had an interface to TOAST that allowed for streaming > (well, really chunking) data to a function. That wouldn't help in this > particular case, but it would significantly expand the usefulness of a > streaming version of md5 and all the other hash operators. Sure enough. A native md5(oid_of_large_object), which internally uses lo_read(), didn't seem all that much work to me. Maybe I'm wrong on that side of things. Karsten -- GPG key ID E4071346 @ eu.pool.sks-keyservers.net E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346