Karsten Hilbert wrote:
> On Mon, Oct 05, 2015 at 03:27:26PM +0000, Kevin Grittner wrote:
>
> > Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> >
> > > I am dealing with radiology studies aka DICOM data) one would
> > > want an md5 function which streams in parts of a large object
> > > piece by piece using md5_update and m5_finalize or some such.
> > It would certainly be possible to write a lo_md5(oid) function to do
> > this, but as far as I'm aware nobody has yet done so. How are your
> > C skills?
>
> Next to none, unfortunately :-/
It would be nice to have this API exposed in pgcrypto,
but as a workaround, you could use plperl instead, or a
similarly "easy" PL that provides it.
Here's an example with plperlu, using the core
Digest module. It works with other hash types, not just MD5.
See http://perldoc.perl.org/Digest.html for the exact names
(warning: 'SHA-1' instead of pgcrypto's 'SHA1').
CREATE OR REPLACE FUNCTION lo_digest(text, oid, int)
RETURNS bytea AS
$$
use Digest;
use strict;
use Data::Dumper;
my $ctxt = Digest->new($_[0]);
my $sz=$_[2];
elog(ERROR, "Invalid chunk size: $sz") if ($sz<=0);
my $sth = spi_query("SELECT lo_open($_[1], 262144) as fd");
my $row = spi_fetchrow($sth);
spi_cursor_close($sth);
if ($row) {
my $fd = $row->{fd};
my $bytes;
my $plan = spi_prepare("SELECT loread($fd, $sz) as chunk");
do {
$sth = spi_query_prepared($plan);
$row = spi_fetchrow($sth);
$bytes = decode_bytea($row->{chunk});
$ctxt->add($bytes);
spi_cursor_close($sth);
} while (length($bytes)>0);
spi_exec_query("select lo_close($fd)");
spi_freeplan($plan);
}
return encode_bytea($ctxt->digest);
$$ LANGUAGE plperlu;
Example of use:
# select lo_digest('MD5', 2557608, 2048*256);
lo_digest
------------------------------------
\xa8447e145d0f8d9ca7fe7df1bbf06d75
2557608 is the oid and 2048*256 represents 512KB.
The lo blocksize is typically 2048, so a multiple of 2048 is ideal
to optimize reads.
I expect it to work on very large contents while requesting only modest
amounts of memory. But unfortunately is seems pretty slow.
On my desktop core i5-3470, it takes 2.5 seconds to digest
a 100 MB audio file already in cache. An independant Perl script
doing the equiavlent processing on the same data takes 0.4s .
It's not because of Digest::MD5 or loread(), it seems that it's the
pipe in-between with the text->bytea decoding that eats most of the
CPU cycles, that step being necessary because plperl lacks the ability
to consume bytea directly. Sigh...
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite