Home > mailing lists

Re: regression test failed when enabling checksum - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: regression test failed when enabling checksum
Date	April 1, 2013 17:37:55
Msg-id	CAMkU=1xUza1A3WLRUTQHrP4WaOW1G2y4QK=uoac0NXv9pSk0pQ@mail.gmail.com Whole thread
In response to	Re: regression test failed when enabling checksum (Jeff Davis <pgsql@j-davis.com>)
Responses	Re: regression test failed when enabling checksum regression test failed when enabling checksum
List	pgsql-hackers

Tree view

On Tue, Mar 26, 2013 at 4:23 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2013-03-26 at 02:50 +0900, Fujii Masao wrote:
> Hi,
>
> I found that the regression test failed when I created the database
> cluster with the checksum and set wal_level to archive. I think that
> there are some bugs around checksum feature. Attached is the regression.diff.

Thank you for the report. This was a significant oversight, but simple
to diagnose and fix.

There were several places that were doing something like:

PageSetChecksumInplace
if (use_wal)
log_newpage
smgrextend

Which is obviously wrong, because log_newpage set the LSN of the page,
invalidating the checksum. We need to set the checksum after
log_newpage.

Also, I noticed that copy_relation_data was doing smgrread without
validating the checksum (or page header, for that matter), so I also
fixed that.

Patch attached. Only brief testing done, so I might have missed
something. I will look more closely later.

After applying your patch, I could run the stress test described here:

http://archives.postgresql.org/pgsql-hackers/2012-02/msg01227.php

But altered to make use of initdb -k, of course.

Over 10,000 cycles of crash and recovery, I encountered two cases of checksum failures after recovery, example:

14264 SELECT 2013-03-28 13:08:38.980 PDT:WARNING: page verification failed, calculated checksum 7017 but expected 1098

14264 SELECT 2013-03-28 13:08:38.980 PDT:ERROR: invalid page in block 77 of relation base/16384/2088965

14264 SELECT 2013-03-28 13:08:38.980 PDT:STATEMENT: select sum(count) from foo

In both cases, the bad block (77 in this case) is the same block that was intentionally partially-written during the "crash". However, that block should have been restored from the WAL FPW, so its fragmented nature should not have been present in order to be detected. Any idea what is going on?

Unfortunately I already cleaned up the data directory before noticing the problem, so I have nothing to post for forensic analysis. I'll try to reproduce the problem.

Without the initdb -k option, I ran it for 30,000 cycles and found no problems. I don't think this is because the problem exists but is going undetected, because my test is designed to detect such problems--if the block is fragmented but not overwritten by WAL FPW, that should occasionally lead to detectable inconsistent tuples.

I don't think your patch caused this particular problem, but it merely fixed a problem that was previously preventing me from running my test.

Cheers,

Jeff

pgsql-hackers by date:

From: Merlin Moncure
Date: 01 April 2013, 16:55:15
Subject: Re: Page replacement algorithm in buffer cache

From: Jeff Janes
Date: 01 April 2013, 18:00:17
Subject: Re: pgbench --startup option

Re: regression test failed when enabling checksum - Mailing list pgsql-hackers

Previous

Next