Re: Test to dump and restore objects left behind by regression - Mailing list pgsql-hackers

From Ashutosh Bapat
Subject Re: Test to dump and restore objects left behind by regression
Date
Msg-id CAExHW5uvx2LEyrUBdctV5gS25Zeb+-eXESkK93siQxWSjYFy6A@mail.gmail.com
Whole thread Raw
In response to Re: Test to dump and restore objects left behind by regression  (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>)
List pgsql-hackers
Hi Tom and Michael,

Thanks for your inputs.

I am replying to all the comments in a single email arranging related
comments together.

On Thu, Oct 31, 2024 at 11:26 AM Michael Paquier <michael@paquier.xyz> wrote:
>
> On my laptop, testing the plain format adds roughly 12s, in a test
> that now takes 1m20s to run vs 1m32s.  Enabling regress_dump_formats
> and adding three more formats counts for 45s of runtime.  For a test
> that usually shows up as the last one to finish for a heavily
> parallelized run.  So even the default of "plain" is going to be
> noticeable, I am afraid.

> On Thu, Oct 31, 2024 at 10:26:01AM -0400, Tom Lane wrote:
> > I'd be okay with adding it in a form where the default behavior
> > is to do no additional checking.

If I run the test alone, it takes 45s (master) vs 54s (with patch) on
my laptop. These readings are similar to what you have observed. The
restore step by itself takes most of the time, even if a. we eliminate
data, b. use formats other than plain or c. use --jobs=2. Hence I am
fine with Tom's suggestion i.e. default behaviour is to do no
additional testing. I propose to test all dump formats (including
plain) only when PG_TEST_EXTRA has "regress_dump_tests". But see next

>
> What's the advantage of testing all the formats?  Would that stuff
> have been able to catch up more issues related to specific format(s)
> when it came to the compression improvements with inheritance?

I haven't caught any more issues with formats other than "plain". It
is more for future-proof testing. I am fine if we want to test just
plain dump format for now. Adding more formats would be easier if
required.

>> Whether that's worth maintaining
> > is hard to say though.
>
> In terms of maintenance, it would be nice if we are able to minimize
> the code added to the pg_upgrade suite, so as it would be simple to
> switch this code elsewhere if need be.

I think Tom hints at maintenance of
AdjustDump::adjust_dump_for_restore(). In future, if the difference
between dump from the original database and that from the restored
database grows, we will need to update
AdjustDump::adjust_dump_for_restore() accordingly. That will be some
maintenance. But the person introducing such changes will get a chance
to fix them if unintentional. That balances out any maintenance
efforts, I think.

>
> +   test_regression_dump_restore($oldnode, %node_params);
>
> Why is this only done for the main regression test suite?  Perhaps it
> could be useful as well for tests that want to check after their own
> custom dumps, as a shortcut?
>
> Linked to that.  Could there be some use in being able to pass down a
> list of databases to this routine, rather than being limited only to
> "regression"?  Think extension databases with USE_MODULE_DB that have
> unique names.

I did think of it when implementing this function. In order to test
the custom dumps or extensions, adjust_regress_dumpfile() will need to
be extensible or the test will need a way to accept a custom dump file
for comparison. Without a concrete use case, adding the customization
hooks might go wrong and will need rework.
test_regression_dump_restore() itself is isolated enough that we can
extend it when the need arises. When the need arises we will know what
needs to be extensible and how. If you have a specific use case,
please let me know, I will accommodate it in my patch.

> Perhaps we
> should be more ambitious and move more logic into AdjustDump.pm?  If
> we think that the full cycle of dump -> restore -> dump -> compare
> could be used elsewhere, this would limit the footprint of what we are
> doing in the pg_upgrade script in this patch and be able to do similar
> stuff in out-of-core extensions or other tests.  Let's say a
> PostgreSQL::Test::Dump.pm?

dump->restore->dump->compare pattern is seen only in 002_pg_upgrade
test. 002_compare_backups compares dumps from servers but does not use
the dump->restore->dump->compare pattern. If a similar pattern starts
appearing at multiple places, we will easily move
test_regression_dump_restore() to a common module to avoid code
duplication. That function is isolated enough for that purpose.

> - Dump of a database into an output file given in input, as a routine
> of Cluster.pm so as it is possible to do dumps from different major
> versions.  Format should be defined in input.

SInce you are suggesting adding the new routine to Cluster.pm, I
assume that you would like to use it in many tests (ideally every test
which uses pg_dump). I did attempt this when I wrote the last version
of the patch. Code to run a pg_dump command is just a few lines. The
tests invoke pg_dump in many different ways with many different
combinations of arguments. In order to cater all those invocations,
the routine in Cluster.pm needs to be very versatile and thus complex.
It will be certainly a dozen lines at least. If such a routine would
have been useful, it would have been added to Cluster.pm already. It's
not there, because it won't be useful.

We could turn the two invocations of pg_dump for comparison (in the
patch) into a routine if that helps. It might shave a few lines of
code. Since the routine won't be general, it should reside in
002_pg_upgrade where it is used.

If you have something else in mind, please let me know.

> - Restore to a database from an input file, also as a routine of
> Cluster.pm, for the major version argument.

Similar to above, each of the pg_restore invocations are just a few
lines but there is a lot of variety in those invocations.

> - Filter of the dumps for the contents where column ordering is
> inconsistent up at restore.  In a new module.

Please note, this is filtering + adjustment. The routine is already in
a new module as you suggested earlier.

>
> I'm wondering if it would make sense to also externalize the dump
> comparison routine currently in the pg_upgrade script.
> - Comparison of two dumps, with potentially filters applied to them,
> with diff printed.  In a new module.

It is a good idea to externalize the compare_dump() function in
PostgreSQL::Test::Utils. Similar code exists in
002_compare_backups.pl. 027_stream_regress.pl also uses compare() to
compare dump files but it uses `diff` command for the same. We can
change both usages to use compare_dump().

>
> +   # Dump the original database in "plain" format for comparison later. The
> +   # order of columns in COPY statements dumped from the original database and
> [...]
> +   # Adjust the CREATE TABLE ... INHERITS statements.
> +   if ($original)
> +   {
> +       $dump =~ s/(^CREATE\sTABLE\sgenerated_stored_tests\.gtestxx_4\s\()
> +                  (\n\s+b\sinteger),
> +                  (\n\s+a\sinteger)/$1$3,$2/mgx;
>
> The reason why $original exists is documented partially in both
> 002_pg_upgrade.pl and AdjustDump.pm.  It would be better to
> consolidate that only in AdjustDump.pm, I guess.

I believe the comment in 0002_pg_upgrade.pl you quoted above and the
prologue of adjust_regress_dumpfile() are the two places you are
referring to. They serve different purposes. The one in 002_pg_upgrade
explains why we dump only schema for comparison. It is independent of
whether the dump is taken from the original database or target
database. The argument "original" to adjust_regress_dumpfile() is only
explained in the function's prologue in AdjustDump.pm. Am I missing
something?

> Isn't the name
> "$original" a bit too general when it comes to applying filters to
> the dumps to as the order of the column re-dumped is under control?
> Perhaps it would be adapted to use a hash that can be extended with
> more than one parameter to control which filters are applied?  For
> example, imagine a %params where the caller of adjust_dumpfile() can
> pass in a "filter_column_order => 1".  The filters applied to the dump
> are then self-documented.  We could do with a second for the
> whitespaces, as well.

I agree that "original" is a generic name. And I like your suggestion
partly. I will rename it as "adjust_column_order".

But I don't think we need to use a hash since filters like whitespace
are not dependent upon whether the dump is from source or target
database. IOW those filters are not optional. It will add extra
redirection unnecessarily. If in future we have to add another
adjustment which is applicable under certain conditions, we could use
a hash of switches but till then let's keep it simple.

--
Best Wishes,
Ashutosh Bapat



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Pgoutput not capturing the generated columns
Next
From: Amit Kapila
Date:
Subject: Re: Disallow UPDATE/DELETE on table with unpublished generated column as REPLICA IDENTITY