Re: [pgsql-hackers] Daily digest v1.9418 (15 messages) - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: [pgsql-hackers] Daily digest v1.9418 (15 messages)
Date
Msg-id f67928030908270947h10862c70h9e3a4c59ab21f337@mail.gmail.com
Whole thread Raw
Responses Re: [pgsql-hackers] Daily digest v1.9418 (15 messages)  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
---------- Forwarded message ----------
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Robert Haas <robertmhaas@gmail.com>
Date: Thu, 27 Aug 2009 10:11:24 -0400
Subject: Re: 8.5 release timetable, again

What I'd like to see is some sort of test mechanism for WAL recovery.
What I've done sometimes in the past (and recently had to fix the tests
to re-enable) is to kill -9 a backend immediately after running the
regression tests, let the system replay the WAL for the tests, and then
take a pg_dump and compare that to the dump gotten after a conventional
run.  However this is quite haphazard since (a) the regression tests
aren't especially designed to exercise all of the WAL logic, and (b)
pg_dump might not show the effects of some problems, particularly not
corruption in non-system indexes.  It would be worth the trouble to
create a more specific test methodology.

I hacked mdwrite so that it had a static int counter.  When the counter hit 400 and if the guc_of_death was set, it would write out a partial block (to simulate a partial page write) and then PANIC.  I have some Perl code that runs against the database doing a bunch of updates until the database dies.  Then when it can reconnect again it makes sure the data reflects what Perl thinks it should.  This is how I (belatedly) found and traced down the bug in the visibility bit.  (What I was trying to do is determine if my toying around with XLogInsert was breaking anything.  Since the regression suit wouldn't show me a problem if one existed, I came up with this.  Then I found things were broken even before I started toying with it...)

I don't know how lucky I was to hit open a test that found an already existing bug.  I have to assume I was somewhat lucky, simply because it took a run of many hours or overnight (with a simulated crash every 2 minutes or so) to reliably detect the problem.  But how do you turn something like this into a regression test?  Scattering the code with intentional crash inducing code that is there to exercise the error recover parts seems like it would be quite a mess.

 
In short: merely making the tests bigger doesn't impress me in the
least.  Focused testing on areas we aren't covering at all could be
worth the trouble.

Do you have suggestions on what other areas need it?
 
Jeff

pgsql-hackers by date:

Previous
From: Jaime Casanova
Date:
Subject: Re: MySQL Compatibility WAS: 8.5 release timetable, again
Next
From: Tom Lane
Date:
Subject: Re: pretty print viewdefs