Re: Image File System Question - Mailing list pgsql-general

From Jan Peterson
Subject Re: Image File System Question
Date
Msg-id 72e966b00511041203o575d1493lb9e233eee72e6ecf@mail.gmail.com
Whole thread Raw
In response to Re: Image File System Question  (Csaba Nagy <nagy@ecircle-ag.com>)
List pgsql-general
We have a similar functional requirement to Vishal's, but with an
added twist.  Currently, we are utilizing the pg_largeobject table to
store context data delivered through our java application.  We are
finding, however, that very large pg_largeobject tables degrade our
performance in other areas (primarily VACUUM) due to I/O limitations
of our hardware (an appliance-like device).  We've been experimenting
with customized functions that provide similar access mechanisms as
are available with large object support.  Here are some examples of
what we've been experimenting with:

CREATE OR REPLACE FUNCTION fileread(varchar, varchar, int8, int8)
  RETURNS bytea AS
$BODY$
    my ($type, $hash, $offset, $length) = @_;
    my $file = $_SHARED{filename}->($type, $hash);
    my $data;
    my $fh;

    # some argument validation code removed here

    if (! open($fh, "< $file")) {
        elog(ERROR, "unable to open $file: $!");
    }
    binmode $fh;

    if ((defined($offset)) and ($offset > 0)) {
        if (! sysseek($fh, $offset, 0)) {
            elog(ERROR, "can't seek to pos $offset in $file: $!");
        }
    }

    my $numread = sysread($fh, $data, $length);
    if (! defined($numread)) {
        elog(ERROR, "unable to read $file: $!");
    }

    close($fh);
    return undef if $numread == 0;

    # here is where it gets ugly due to the way we have to munge
    # the data coming back from perl -> postgres

    $data =~ s/([\0\\\'])/sprintf("\\\\%03o",ord($1))/ge;
    return($data);
$BODY$
  LANGUAGE 'plperlu' VOLATILE;

There is also a function that sets up the %_SHARED hash to hold some
utility functions (filename is one of them, whose job it is to convert
the hash ... a.k.a. filename ... to a full path containing
sub-directories).  A similar function filewrite() is also used to
handling paging the data into the filesystem.

Problems we've encountered with this mechanism are: the evil
substitution required to quote the bytea value being returned from the
function (because we potentially have binary data including null
characters, single quotes, and backslashes), the lack of persistance
of %_SHARED (fixed in 8.1, I think, but we're using 8.0.2 + some local
patches), and the limitations of the underlying filesystem (many of
our objects are < 4k in size, but the linux ext3 filesystem we're
using has no support for storing multiple fragments or tail fragments
in a single data block, so the minimum file size on the filesystem is
4k, which kills us).

We're looking now at re-writing this stuff in C and storing the tail
fragment < 4k of the data stream directly in a bytea column in our
main table.  Still no idea if this will really solve our VACUUM
problems, but it's the best thing we've been able to come up with so
far.  Another drawback, of course, is the lack of transactional
security of this externally stored data (what if we have a statement
that does a select deletefile('type', 'hash'); and then it needs to
roll back?  answer: we're hosed).  I'd be happy to hear any
suggestions for solutions to the above problems.

        -jan-
--
Jan L. Peterson
<jan.l.peterson@gmail.com>

pgsql-general by date:

Previous
From: "Peter Filipov"
Date:
Subject: I just can't get it:record_in, record_out(I know i try to do something postgres is not developed for but there are many 'buts')
Next
From: "Jim C. Nasby"
Date:
Subject: Re: Replicating databases