Re: [CORE] Restore-reliability mode - Mailing list pgsql-hackers

From Craig Ringer
Subject Re: [CORE] Restore-reliability mode
Date
Msg-id CAMsr+YEA_YwGMTeG-zGDNL1_RwxN4dJYQ9xQfg8np3CoF4bQ1A@mail.gmail.com
Whole thread Raw
In response to Re: [CORE] Restore-reliability mode  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [CORE] Restore-reliability mode  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers


On 4 June 2015 at 22:43, Stephen Frost <sfrost@snowman.net> wrote:
Josh,

* Josh Berkus (josh@agliodbs.com) wrote:
> I would argue that if we delay 9.5 in order to do a 100% manual review
> of code, without adding any new automated tests or other non-manual
> tools for improving stability, then it's a waste of time; we might as
> well just release the beta, and our users will find more issues than we
> will.  I am concerned that if we declare a cleanup period, especially in
> the middle of the summer, all that will happen is that the project will
> go to sleep for an extra three months.

This is the exact same concern that I have.  A delay just to have a
delay is not useful.  I completely agree that we need more automated
testing, etc, though getting all of that set up and running could be
done at any time too- there's no reason to wait, nor do I believe
delaying 9.5 would make such automated testing appear.


In terms of specific testing improvements, things I think we need to have covered and runnable on the buildfarm are:

* pg_dump and pg_restore testing (because it's scary we don't do this)
* WAL archiving based warm standby testing with promotion
* Two node streaming replication with promotion, both with a slot and with archive fallback
* Three node cascading streaming replication with middle node promotion then tail end node promotion
* Logical decoding streaming testing, comparing to expected decoded output
* DDL deparse test coverage for all operations
* pg_basebackup + start up from backup
* hard-kill the postmaster, start up from crashed datadir
* pg_start_backup, rsync, pg_stop_backup, start up in hot standby
* disk exhaustion tests both for pg_xlog and for the main datadir, showing we can recover OK when disk is filled then space is freed
* Tests of crash recovery during various DDL operations

Obviously some of these overlap, so one test can cover more than one item.

Implementing these requires stepping outside the comfortable zone of pg_regress and the isolationtester and having something that can manage multiple data directories. It's also hard to be sure you're testing the same thing each time - for example, when using streaming replication with archive fallback, it might be tricky to ensure that your replica falls behind and falls back to WAL archive each time. There's always SIGSTOP I guess.

While these are multi-node tests, at least in PostgreSQL we can just run on different ports, so there's no need to muck about with containers or VMs.

I already run some of these tests using Ansible for BDR, but I don't imagine that'd be acceptable in core. It's Python, and it's not especially well suited to use as a regression testing framework, it's just what I had to hand and already needed for other automation tasks.

Is pg_tap a reasonable starting point for this sort of testing?

Am I missing obvious and important tests?

How would a test that would've caught the multixact issues look?

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Andrew Dunstan
Date:
Subject: Re: Further issues with jsonb semantics, documentation