Thread: Re: [GENERAL] pg_dump behaves differently for different archive formats

Re: [GENERAL] pg_dump behaves differently for different archive formats

From
Albe Laurenz
Date:
Tom Lane wrote on Dec 16, 2013:
> Albe Laurenz <laurenz.albe@wien.gv.at> writes:
>> Restoring a "plain format" dump and a "custom format" dump of
>> the same database can lead to different results:

>> pg_dump organizes the SQL statements it creates in "TOC entries".
>> If a custom format dump is restored with pg_restore, all
>> SQL statements in a TOC entry will be executed as a single command
>> and thus in a single transaction.

> Yeah, this is a bug I think.  pg_dump was designed around the idea
> that the output would be executed as a simple script, and in a
> number of places there's an expectation that one SQL statement
> can fail without affecting following ones.  So if pg_restore can't
> provide that behavior it's not good.
> 
> On the other hand, I'm not sure how much enthusiasm there'd be for
> complex or fragile changes to fix this.  A lot of people invariably
> run restores in single-transaction mode and don't really care about
> fault-tolerant restores.  Also, it's easy enough to dodge the problem
> if you must: just pipe the output into psql rather than
> direct-to-database.
> 
> So to me the question is can we fix this without doing something like
> duplicating psql's lexer?  If we have to parse out the statements
> contained in each text blob, it's probably going to be too painful.
> Some cautionary history about this sort of thing can be read at
> http://www.postgresql.org/message-id/flat/18006.1325700782@sss.pgh.pa.us

I thought that changing the dump format for this would be too
much trouble, so I came up with the attached.

It assumes that custom- or tar-format archives are written by pg_dump
and cannot contain arbitrary SQL statements, which allows me to get away
with very simple parsing.

If this is not shot down immediately on account of fragility, I'd
add it to the next commitfest page.

The problem has been a pain point for my co-workers in the past;
using single-transaction mode doesn't work for us, since we have custom objects
in our template database that cause expected errors when a dump is restored.

Yours,
Laurenz Albe

Attachment
Albe Laurenz <laurenz.albe@wien.gv.at> writes:
> I thought that changing the dump format for this would be too
> much trouble, so I came up with the attached.

> It assumes that custom- or tar-format archives are written by pg_dump
> and cannot contain arbitrary SQL statements, which allows me to get away
> with very simple parsing.

I don't think this can be trusted in the least.  To begin with, where'd
you get the idea dumps cannot contain "arbitrary SQL statements"?  CREATE
RULE at least could contain some pretty weird stuff.  This thing doesn't
look like it's even bothering to count nested parentheses, so it will
certainly fail on a multi-statement rule.  I believe you're also at risk
of SQL injection attacks from failing to account for multibyte characters
in non-ASCII-safe client encodings.

While those specific problems could no doubt be fixed, I object to the
entire concept of assuming that what pg_dump emits is always going to be
trivially parsable.  If we are to go down this path, I think we have to
replicate what psql is doing to identify statement boundaries ... and
as I mentioned upthread, that's rather a lot of code :-(
        regards, tom lane



Re: Re: [GENERAL] pg_dump behaves differently for different archive formats

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Albe Laurenz <laurenz.albe@wien.gv.at> writes:
> > I thought that changing the dump format for this would be too
> > much trouble, so I came up with the attached.

If we're going to change this, it seems to me that the only option would
be to change the dump format...  Just off-the-cuff, I'm wondering if we
could actually not change the real 'format' but simply promote each ACL
entry (and similar cases..) to top-level objects and declare that TOC
entries should be single statements.

> While those specific problems could no doubt be fixed, I object to the
> entire concept of assuming that what pg_dump emits is always going to be
> trivially parsable.  If we are to go down this path, I think we have to
> replicate what psql is doing to identify statement boundaries ... and
> as I mentioned upthread, that's rather a lot of code :-(

Agreed.  If we want this, we should handle it on the pg_dump side, not
try and work it out on the pg_restore side.
Thanks,
    Stephen

Stephen Frost <sfrost@snowman.net> writes:
> If we're going to change this, it seems to me that the only option would
> be to change the dump format...  Just off-the-cuff, I'm wondering if we
> could actually not change the real 'format' but simply promote each ACL
> entry (and similar cases..) to top-level objects and declare that TOC
> entries should be single statements.

I don't think we want even more TOC entries, but it would not be
unreasonable to insist that the statement(s) within a TOC entry be
subdivided somehow.  Essentially the payload of a TOC entry becomes
a list of strings rather than just one string.

That would mean that the problem could not be fixed for existing archive
files; but that seems OK, given the rather small number of complaints
so far.

If we had something like that, I'd be strongly inclined to get rid of
the existing convention whereby comments and ACL commands are separate
TOC entries, and make them part of the parent object's TOC entry (which'd
mean we'd want to label the sub-strings so we can tell whether they are
main object, comment, or ACL).  The fewer TOC entries we can have, the
better; there is no reason why comments/ACLs should be independently
sortable.
        regards, tom lane



On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> If we're going to change this, it seems to me that the only option would
>> be to change the dump format...  Just off-the-cuff, I'm wondering if we
>> could actually not change the real 'format' but simply promote each ACL
>> entry (and similar cases..) to top-level objects and declare that TOC
>> entries should be single statements.
>
> I don't think we want even more TOC entries, but it would not be
> unreasonable to insist that the statement(s) within a TOC entry be
> subdivided somehow.  Essentially the payload of a TOC entry becomes
> a list of strings rather than just one string.
>
> That would mean that the problem could not be fixed for existing archive
> files; but that seems OK, given the rather small number of complaints
> so far.
>
> If we had something like that, I'd be strongly inclined to get rid of
> the existing convention whereby comments and ACL commands are separate
> TOC entries, and make them part of the parent object's TOC entry (which'd
> mean we'd want to label the sub-strings so we can tell whether they are
> main object, comment, or ACL).  The fewer TOC entries we can have, the
> better; there is no reason why comments/ACLs should be independently
> sortable.

Maybe, but I think people will still want an option to skip restoring
them altogether (at least for ACLs).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jul 28, 2014 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If we had something like that, I'd be strongly inclined to get rid of
>> the existing convention whereby comments and ACL commands are separate
>> TOC entries, and make them part of the parent object's TOC entry (which'd
>> mean we'd want to label the sub-strings so we can tell whether they are
>> main object, comment, or ACL).  The fewer TOC entries we can have, the
>> better; there is no reason why comments/ACLs should be independently
>> sortable.

> Maybe, but I think people will still want an option to skip restoring
> them altogether (at least for ACLs).

Sure; we already have such an option, and I'm not suggesting removing
it.  The implementation would change though: it would have to look at
the individual command labels to see which part(s) of a TOC entry to
print out.
        regards, tom lane