Thread: some pg_dump query code simplification

some pg_dump query code simplification

From
Peter Eisentraut
Date:
There is a lot of version dependent code in pg_dump now, especially
per-version queries.  The way they were originally built was that we
just have an entirely separate query per version.  But as the number of
versions and the size of the query grows, this becomes unwieldy.

So I tried, as an example, to write the queries in getTableAttrs()
differently.  Instead of repeating the almost same large query in each
version branch, use one query and add a few columns to the SELECT list
depending on the version.  This saves a lot of duplication and is easier
to deal with.  I think this patch is useful in its own right, but there
is also the larger question of whether people think this is a better
style going forward, for pg_dump and psql.  (It's already in use in psql
in some places.)  It won't always work, if the query structure is very
different between versions, but as this example shows, it can be useful.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: some pg_dump query code simplification

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> There is a lot of version dependent code in pg_dump now, especially
> per-version queries.  The way they were originally built was that we
> just have an entirely separate query per version.  But as the number of
> versions and the size of the query grows, this becomes unwieldy.

Agreed, it's becoming a bit of a mess.

> So I tried, as an example, to write the queries in getTableAttrs()
> differently.  Instead of repeating the almost same large query in each
> version branch, use one query and add a few columns to the SELECT list
> depending on the version.  This saves a lot of duplication and is easier
> to deal with.

I think I had this discussion already with somebody, but ... I do not
like this style at all:

            tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, j, i_attidentity)) : '\0');

It's basically throwing away all the cross-checking that exists now to
ensure that you didn't forget to deal with a column, misspell the column's
AS name in one place or the other, etc etc.  The code could be completely
wrong and it'd still fail silently, producing a default result that might
well be the right answer, making it hard to debug.  I think the way to do
this is to continue to require the query to produce the same set of
columns regardless of server version.  So the version-dependent bits would
look like, eg,

        if (fout->remoteVersion >= 110000)
            appendPQExpBufferStr(q,
                                 "CASE WHEN a.atthasmissing AND NOT a.attisdropped "
                                 "THEN a.attmissingval ELSE null END AS attmissingval,\n");
        else
            appendPQExpBufferStr(q,
                                 "null AS attmissingval,\n");

and the data extraction code would remain the same as now.

Other than that nit, I agree that this shows a lot of promise.

            regards, tom lane


Re: some pg_dump query code simplification

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> > There is a lot of version dependent code in pg_dump now, especially
> > per-version queries.  The way they were originally built was that we
> > just have an entirely separate query per version.  But as the number of
> > versions and the size of the query grows, this becomes unwieldy.
>
> Agreed, it's becoming a bit of a mess.
>
> > So I tried, as an example, to write the queries in getTableAttrs()
> > differently.  Instead of repeating the almost same large query in each
> > version branch, use one query and add a few columns to the SELECT list
> > depending on the version.  This saves a lot of duplication and is easier
> > to deal with.
>
> I think I had this discussion already with somebody, but ... I do not
> like this style at all:
>
>             tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, j, i_attidentity)) : '\0');
>
> It's basically throwing away all the cross-checking that exists now to
> ensure that you didn't forget to deal with a column, misspell the column's
> AS name in one place or the other, etc etc.  The code could be completely
> wrong and it'd still fail silently, producing a default result that might
> well be the right answer, making it hard to debug.  I think the way to do
> this is to continue to require the query to produce the same set of
> columns regardless of server version.  So the version-dependent bits would
> look like, eg,
>
>         if (fout->remoteVersion >= 110000)
>             appendPQExpBufferStr(q,
>                                  "CASE WHEN a.atthasmissing AND NOT a.attisdropped "
>                                  "THEN a.attmissingval ELSE null END AS attmissingval,\n");
>         else
>             appendPQExpBufferStr(q,
>                                  "null AS attmissingval,\n");
>
> and the data extraction code would remain the same as now.
>
> Other than that nit, I agree that this shows a lot of promise.

+1 to Tom's thoughts on this- seems like a much better approach.

Overall, I definitely agree that it'd be nice to reduce the amount of
duplication happening today.  Working out some way to actually test all
of those queries would be awful nice too.

I wonder- what if we had an option to pg_dump to explicitly tell it what
the server's version is and then have TAP tests to run with different
versions?  There may be some cases where that doesn't work, of course,
and we'd need to address that, but having those tests might reduce the
number of times we end up with something committed which doesn't work at
all against some older version of PG because it wasn't tested...

Thanks!

Stephen

Attachment

Re: some pg_dump query code simplification

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I wonder- what if we had an option to pg_dump to explicitly tell it what
> the server's version is and then have TAP tests to run with different
> versions?

Uh ... telling it what the version is doesn't make that true, so I'd
have no confidence in a test^H^H^H^Hkluge done that way.  The way
to test is to point it at an *actual* back-branch server.

Andrew has a buildfarm module that does precisely that, although
I'm not sure what its test dataset is --- probably the regression
database from each branch.  I also have a habit of doing such testing
manually whenever I touch version-sensitive parts of pg_dump.

Dunno about the idea of running the pg_dump TAP tests against back
branches.  I find that code sufficiently unreadable that maintaining
several more copies of it doesn't sound like fun at all.

            regards, tom lane


Re: some pg_dump query code simplification

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I wonder- what if we had an option to pg_dump to explicitly tell it what
> > the server's version is and then have TAP tests to run with different
> > versions?
>
> Uh ... telling it what the version is doesn't make that true, so I'd
> have no confidence in a test^H^H^H^Hkluge done that way.  The way
> to test is to point it at an *actual* back-branch server.

I certainly agree that this would be ideal, but nonetheless, I've seen
multiple cases where just trying to run the query, even against a
current version, would have shown that it's malformed or has some issue
which needs fixing and today we haven't even got that.

> Andrew has a buildfarm module that does precisely that, although
> I'm not sure what its test dataset is --- probably the regression
> database from each branch.  I also have a habit of doing such testing
> manually whenever I touch version-sensitive parts of pg_dump.

I've gotten better about doing that back-branch testing myself and
certainly prefer it, but I think we should also have buildfarm coverage.
I don't think we have the full matrix covered, or, really, anything
anywhere near it, so I'm looking for other options to at least get that
code exercised.

> Dunno about the idea of running the pg_dump TAP tests against back
> branches.  I find that code sufficiently unreadable that maintaining
> several more copies of it doesn't sound like fun at all.

I hadn't been thinking we'd need more copies of it, simply a few more
runs which have different version values specified, though even just
doing a pg_dump of the regression suite for each combination would be
something.

Thanks!

Stephen

Attachment

Re: some pg_dump query code simplification

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>>> I wonder- what if we had an option to pg_dump to explicitly tell it what
>>> the server's version is and then have TAP tests to run with different
>>> versions?

>> Uh ... telling it what the version is doesn't make that true, so I'd
>> have no confidence in a test^H^H^H^Hkluge done that way.  The way
>> to test is to point it at an *actual* back-branch server.

> I certainly agree that this would be ideal, but nonetheless, I've seen
> multiple cases where just trying to run the query, even against a
> current version, would have shown that it's malformed or has some issue
> which needs fixing and today we haven't even got that.

Yeah, but cases where we need to touch column C in one version and column
D in another can't be made to pass when pg_dump is operating under a false
assumption about the server version.  So this seems like a nonstarter.

            regards, tom lane


Re: some pg_dump query code simplification

From
Stephen Frost
Date:
Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> Stephen Frost <sfrost@snowman.net> writes:
> >>> I wonder- what if we had an option to pg_dump to explicitly tell it what
> >>> the server's version is and then have TAP tests to run with different
> >>> versions?
>
> >> Uh ... telling it what the version is doesn't make that true, so I'd
> >> have no confidence in a test^H^H^H^Hkluge done that way.  The way
> >> to test is to point it at an *actual* back-branch server.
>
> > I certainly agree that this would be ideal, but nonetheless, I've seen
> > multiple cases where just trying to run the query, even against a
> > current version, would have shown that it's malformed or has some issue
> > which needs fixing and today we haven't even got that.
>
> Yeah, but cases where we need to touch column C in one version and column
> D in another can't be made to pass when pg_dump is operating under a false
> assumption about the server version.  So this seems like a nonstarter.

Yes, there are cases that wouldn't work.

I'd be much happier with a complete solution, were that forthcoming.

Thanks!

Stephen

Attachment

Re: some pg_dump query code simplification

From
Andrew Dunstan
Date:

On 08/28/2018 06:05 PM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> I wonder- what if we had an option to pg_dump to explicitly tell it what
>> the server's version is and then have TAP tests to run with different
>> versions?
> Uh ... telling it what the version is doesn't make that true, so I'd
> have no confidence in a test^H^H^H^Hkluge done that way.  The way
> to test is to point it at an *actual* back-branch server.
>
> Andrew has a buildfarm module that does precisely that, although
> I'm not sure what its test dataset is --- probably the regression
> database from each branch.  I also have a habit of doing such testing
> manually whenever I touch version-sensitive parts of pg_dump.


It's all the databases from a buildfarm run apart from TAP tests. Since 
it uses USE_MODULE_DB=1 it covers most of the contrib modules plus the 
standard regression db, as well as isolation test and pl_regression 
(which should be taught to do separate DBs for each PL if requested).


There is no reason it couldn't test more.


> Dunno about the idea of running the pg_dump TAP tests against back
> branches.  I find that code sufficiently unreadable that maintaining
> several more copies of it doesn't sound like fun at all.
>
>             


Agreed. The code could do with a lot of comments. I recently looked at 
adding something to it and decided I had more pressing things to do.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some pg_dump query code simplification

From
Andrew Dunstan
Date:

On 08/28/2018 06:10 PM, Stephen Frost wrote:
>
>> Andrew has a buildfarm module that does precisely that, although
>> I'm not sure what its test dataset is --- probably the regression
>> database from each branch.  I also have a habit of doing such testing
>> manually whenever I touch version-sensitive parts of pg_dump.
> I've gotten better about doing that back-branch testing myself and
> certainly prefer it, but I think we should also have buildfarm coverage.
> I don't think we have the full matrix covered, or, really, anything
> anywhere near it, so I'm looking for other options to at least get that
> code exercised.
>
>


If by matrix you mean the version matrix, then the module in question 
covers every live branch as the source and every same or later live 
branch as the destination.

It could conceivably cover older branches as well - the branch names 
aren't actually hardcoded - I'd just have to be able to build them and 
do a standard buildfarm run once.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: some pg_dump query code simplification

From
Peter Eisentraut
Date:
On 28/08/2018 23:43, Tom Lane wrote:
> I think I had this discussion already with somebody, but ... I do not
> like this style at all:
> 
>             tbinfo->attidentity[j] = (i_attidentity >= 0 ? *(PQgetvalue(res, j, i_attidentity)) : '\0');

OK, updated patch attached.  If the updated style is acceptable, I'll
start running more extensive tests against the older branches.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: some pg_dump query code simplification

From
Tom Lane
Date:
Peter Eisentraut <peter.eisentraut@2ndquadrant.com> writes:
> OK, updated patch attached.  If the updated style is acceptable, I'll
> start running more extensive tests against the older branches.

Looks sane to me.

            regards, tom lane


Re: some pg_dump query code simplification

From
Stephen Frost
Date:
Greetings,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:
> On 08/28/2018 06:10 PM, Stephen Frost wrote:
> >>Andrew has a buildfarm module that does precisely that, although
> >>I'm not sure what its test dataset is --- probably the regression
> >>database from each branch.  I also have a habit of doing such testing
> >>manually whenever I touch version-sensitive parts of pg_dump.
> >I've gotten better about doing that back-branch testing myself and
> >certainly prefer it, but I think we should also have buildfarm coverage.
> >I don't think we have the full matrix covered, or, really, anything
> >anywhere near it, so I'm looking for other options to at least get that
> >code exercised.
>
> If by matrix you mean the version matrix, then the module in question covers
> every live branch as the source and every same or later live branch as the
> destination.

That's certainly better than I had thought it did.

> It could conceivably cover older branches as well - the branch names aren't
> actually hardcoded - I'd just have to be able to build them and do a
> standard buildfarm run once.

Having it cover older branches really is pretty important given how much
code there is for those older branches in pg_dump.  I've had pretty good
success compiling back down to about 7.2, as I recall, using patches
that Greg Stark made available.  Older is probably possible too.
Getting everything we claim to support covered would be fantastic (which
I believe is basically 9.6/9.5/9.4/9.3 against 7.0 and up, and
master/11/10 against 8.0 and up).

Thanks!

Stephen

Attachment

Re: some pg_dump query code simplification

From
Stephen Frost
Date:
Greetings,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:
> On 08/28/2018 06:05 PM, Tom Lane wrote:
> >Dunno about the idea of running the pg_dump TAP tests against back
> >branches.  I find that code sufficiently unreadable that maintaining
> >several more copies of it doesn't sound like fun at all.
>
> Agreed. The code could do with a lot of comments. I recently looked at
> adding something to it and decided I had more pressing things to do.

I'm happy to add more comments to it..  There's a pretty good block that
tries to describe how the tests work above where the tests are actually
defined.  What would help?  Should I include an example in that code
block?  Or are you looking for more comments in the actual test
definitions about what they're covering or why they're included?  Or are
you interested in comments about the actual code down at the bottom
which runs the tests..?

Thanks!

Stephen

Attachment