Re: regression test failed when enabling checksum - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: regression test failed when enabling checksum |
Date | |
Msg-id | CAMkU=1xUza1A3WLRUTQHrP4WaOW1G2y4QK=uoac0NXv9pSk0pQ@mail.gmail.com Whole thread Raw |
In response to | Re: regression test failed when enabling checksum (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: regression test failed when enabling checksum
regression test failed when enabling checksum |
List | pgsql-hackers |
On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2013-03-26 at 02:50 +0900, Fujii Masao wrote:Thank you for the report. This was a significant oversight, but simple
> Hi,
>
> I found that the regression test failed when I created the database
> cluster with the checksum and set wal_level to archive. I think that
> there are some bugs around checksum feature. Attached is the regression.diff.
to diagnose and fix.
There were several places that were doing something like:
PageSetChecksumInplace
if (use_wal)
log_newpage
smgrextend
Which is obviously wrong, because log_newpage set the LSN of the page,
invalidating the checksum. We need to set the checksum after
log_newpage.
Also, I noticed that copy_relation_data was doing smgrread without
validating the checksum (or page header, for that matter), so I also
fixed that.
Patch attached. Only brief testing done, so I might have missed
something. I will look more closely later.
After applying your patch, I could run the stress test described here:
But altered to make use of initdb -k, of course.
Over 10,000 cycles of crash and recovery, I encountered two cases of checksum failures after recovery, example:
14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING: page verification failed, calculated checksum 7017 but expected 1098
14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR: invalid page in block 77 of relation base/16384/2088965
14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT: select sum(count) from foo
In both cases, the bad block (77 in this case) is the same block that was intentionally partially-written during the "crash". However, that block should have been restored from the WAL FPW, so its fragmented nature should not have been present in order to be detected. Any idea what is going on?
Unfortunately I already cleaned up the data directory before noticing the problem, so I have nothing to post for forensic analysis. I'll try to reproduce the problem.
Without the initdb -k option, I ran it for 30,000 cycles and found no problems. I don't think this is because the problem exists but is going undetected, because my test is designed to detect such problems--if the block is fragmented but not overwritten by WAL FPW, that should occasionally lead to detectable inconsistent tuples.
I don't think your patch caused this particular problem, but it merely fixed a problem that was previously preventing me from running my test.
Cheers,
Jeff
pgsql-hackers by date: