Re: regression test failed when enabling checksum - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: regression test failed when enabling checksum
Date
Msg-id CAMkU=1x=261iP1rJz8Z1YJBqnnNUGtJ9yMUaLcQqxKkVKu8iDg@mail.gmail.com
Whole thread Raw
In response to Re: regression test failed when enabling checksum  (Andres Freund <andres@2ndquadrant.com>)
Responses Re: regression test failed when enabling checksum
Re: regression test failed when enabling checksum
List pgsql-hackers
On Wed, Apr 3, 2013 at 2:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:


I just checked and unfortunately your dump doesn't contain all that much
valid WAL:
...
 
So just two checkpoint records.

Unfortunately I  fear that won't be enough to diagnose the problem,
could you reproduce it with a higher wal_keep_segments?

I've been trying, but see message "commit dfda6ebaec67 versus wal_keep_segments".


Looking at some of the log files more, I see that vacuum is involved, but in some way I don't understand.  The crash always happens on a test cycle immediately after the sleep that allows the autovac to kick in and finish.  So the events goes something like this:

...
run the frantic updating of "foo" until crash
recovery
query "foo" and verify the results are consistent with expectations
sleep to allow autovac to do its job.
truncate "foo" and repopulate it.
run the frantic updating of "foo" until crash
recovery
attempt to query "foo" but get the checksum failure.

What the vacuum is doing that corrupts the system in a way that survives the truncate is a mystery to me.

Also, at one point I had the harness itself exit as soon as it detected the problem, but I failed to have it shut down the server.  So the server keep running idle and having autovac do its thing, which produced some interesting log output:

WARNING:  relation "foo" page 45 is uninitialized --- fixing
WARNING:  relation "foo" page 46 is uninitialized --- fixing
...
WARNING:  relation "foo" page 72 is uninitialized --- fixing
WARNING:  relation "foo" page 73 is uninitialized --- fixing
WARNING:  page verification failed, calculated checksum 54570 but expected 34212
ERROR:  invalid page in block 74 of relation base/16384/4931589

This happened 3 times.  Every time, the warnings started on page 45, and they continued up until the invalid page was found (which varied, being 74, 86, and 74 again)

I wonder if the bug is in checksums, or if the checksums are doing their job by finding some other bug.  And why did those uninitialized pages trigger warnings when they were autovacced, but not when they were seq scanned in a query?

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)
Next
From: Robert Haas
Date:
Subject: Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)