Re: Test to dump and restore objects left behind by regression - Mailing list pgsql-hackers
From | Ashutosh Bapat |
---|---|
Subject | Re: Test to dump and restore objects left behind by regression |
Date | |
Msg-id | CAExHW5uvx2LEyrUBdctV5gS25Zeb+-eXESkK93siQxWSjYFy6A@mail.gmail.com Whole thread Raw |
In response to | Re: Test to dump and restore objects left behind by regression (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
List | pgsql-hackers |
Hi Tom and Michael, Thanks for your inputs. I am replying to all the comments in a single email arranging related comments together. On Thu, Oct 31, 2024 at 11:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > On my laptop, testing the plain format adds roughly 12s, in a test > that now takes 1m20s to run vs 1m32s. Enabling regress_dump_formats > and adding three more formats counts for 45s of runtime. For a test > that usually shows up as the last one to finish for a heavily > parallelized run. So even the default of "plain" is going to be > noticeable, I am afraid. > On Thu, Oct 31, 2024 at 10:26:01AM -0400, Tom Lane wrote: > > I'd be okay with adding it in a form where the default behavior > > is to do no additional checking. If I run the test alone, it takes 45s (master) vs 54s (with patch) on my laptop. These readings are similar to what you have observed. The restore step by itself takes most of the time, even if a. we eliminate data, b. use formats other than plain or c. use --jobs=2. Hence I am fine with Tom's suggestion i.e. default behaviour is to do no additional testing. I propose to test all dump formats (including plain) only when PG_TEST_EXTRA has "regress_dump_tests". But see next > > What's the advantage of testing all the formats? Would that stuff > have been able to catch up more issues related to specific format(s) > when it came to the compression improvements with inheritance? I haven't caught any more issues with formats other than "plain". It is more for future-proof testing. I am fine if we want to test just plain dump format for now. Adding more formats would be easier if required. >> Whether that's worth maintaining > > is hard to say though. > > In terms of maintenance, it would be nice if we are able to minimize > the code added to the pg_upgrade suite, so as it would be simple to > switch this code elsewhere if need be. I think Tom hints at maintenance of AdjustDump::adjust_dump_for_restore(). In future, if the difference between dump from the original database and that from the restored database grows, we will need to update AdjustDump::adjust_dump_for_restore() accordingly. That will be some maintenance. But the person introducing such changes will get a chance to fix them if unintentional. That balances out any maintenance efforts, I think. > > + test_regression_dump_restore($oldnode, %node_params); > > Why is this only done for the main regression test suite? Perhaps it > could be useful as well for tests that want to check after their own > custom dumps, as a shortcut? > > Linked to that. Could there be some use in being able to pass down a > list of databases to this routine, rather than being limited only to > "regression"? Think extension databases with USE_MODULE_DB that have > unique names. I did think of it when implementing this function. In order to test the custom dumps or extensions, adjust_regress_dumpfile() will need to be extensible or the test will need a way to accept a custom dump file for comparison. Without a concrete use case, adding the customization hooks might go wrong and will need rework. test_regression_dump_restore() itself is isolated enough that we can extend it when the need arises. When the need arises we will know what needs to be extensible and how. If you have a specific use case, please let me know, I will accommodate it in my patch. > Perhaps we > should be more ambitious and move more logic into AdjustDump.pm? If > we think that the full cycle of dump -> restore -> dump -> compare > could be used elsewhere, this would limit the footprint of what we are > doing in the pg_upgrade script in this patch and be able to do similar > stuff in out-of-core extensions or other tests. Let's say a > PostgreSQL::Test::Dump.pm? dump->restore->dump->compare pattern is seen only in 002_pg_upgrade test. 002_compare_backups compares dumps from servers but does not use the dump->restore->dump->compare pattern. If a similar pattern starts appearing at multiple places, we will easily move test_regression_dump_restore() to a common module to avoid code duplication. That function is isolated enough for that purpose. > - Dump of a database into an output file given in input, as a routine > of Cluster.pm so as it is possible to do dumps from different major > versions. Format should be defined in input. SInce you are suggesting adding the new routine to Cluster.pm, I assume that you would like to use it in many tests (ideally every test which uses pg_dump). I did attempt this when I wrote the last version of the patch. Code to run a pg_dump command is just a few lines. The tests invoke pg_dump in many different ways with many different combinations of arguments. In order to cater all those invocations, the routine in Cluster.pm needs to be very versatile and thus complex. It will be certainly a dozen lines at least. If such a routine would have been useful, it would have been added to Cluster.pm already. It's not there, because it won't be useful. We could turn the two invocations of pg_dump for comparison (in the patch) into a routine if that helps. It might shave a few lines of code. Since the routine won't be general, it should reside in 002_pg_upgrade where it is used. If you have something else in mind, please let me know. > - Restore to a database from an input file, also as a routine of > Cluster.pm, for the major version argument. Similar to above, each of the pg_restore invocations are just a few lines but there is a lot of variety in those invocations. > - Filter of the dumps for the contents where column ordering is > inconsistent up at restore. In a new module. Please note, this is filtering + adjustment. The routine is already in a new module as you suggested earlier. > > I'm wondering if it would make sense to also externalize the dump > comparison routine currently in the pg_upgrade script. > - Comparison of two dumps, with potentially filters applied to them, > with diff printed. In a new module. It is a good idea to externalize the compare_dump() function in PostgreSQL::Test::Utils. Similar code exists in 002_compare_backups.pl. 027_stream_regress.pl also uses compare() to compare dump files but it uses `diff` command for the same. We can change both usages to use compare_dump(). > > + # Dump the original database in "plain" format for comparison later. The > + # order of columns in COPY statements dumped from the original database and > [...] > + # Adjust the CREATE TABLE ... INHERITS statements. > + if ($original) > + { > + $dump =~ s/(^CREATE\sTABLE\sgenerated_stored_tests\.gtestxx_4\s\() > + (\n\s+b\sinteger), > + (\n\s+a\sinteger)/$1$3,$2/mgx; > > The reason why $original exists is documented partially in both > 002_pg_upgrade.pl and AdjustDump.pm. It would be better to > consolidate that only in AdjustDump.pm, I guess. I believe the comment in 0002_pg_upgrade.pl you quoted above and the prologue of adjust_regress_dumpfile() are the two places you are referring to. They serve different purposes. The one in 002_pg_upgrade explains why we dump only schema for comparison. It is independent of whether the dump is taken from the original database or target database. The argument "original" to adjust_regress_dumpfile() is only explained in the function's prologue in AdjustDump.pm. Am I missing something? > Isn't the name > "$original" a bit too general when it comes to applying filters to > the dumps to as the order of the column re-dumped is under control? > Perhaps it would be adapted to use a hash that can be extended with > more than one parameter to control which filters are applied? For > example, imagine a %params where the caller of adjust_dumpfile() can > pass in a "filter_column_order => 1". The filters applied to the dump > are then self-documented. We could do with a second for the > whitespaces, as well. I agree that "original" is a generic name. And I like your suggestion partly. I will rename it as "adjust_column_order". But I don't think we need to use a hash since filters like whitespace are not dependent upon whether the dump is from source or target database. IOW those filters are not optional. It will add extra redirection unnecessarily. If in future we have to add another adjustment which is applicable under certain conditions, we could use a hash of switches but till then let's keep it simple. -- Best Wishes, Ashutosh Bapat
pgsql-hackers by date: